Roles & Responsibilities:
- Drive architectural/technical service improvements that are cutting across all areas working with different teams.
- Call out major risks and issues from a reliability perspective using data to make informed decisions and drive mitigation plans working with different teams.
- Design and build features and tools to support performance and capacity planning.
- Monitor and continually improve the capacity of our production environments infrastructure aligned to the Application performance.
- Identify and define SLOs, SLAs and SLIs from a reliability perspective and ensure automation and analytics drive approach.
- Identify sources to gather diagnostic information and provide solutions.
- Improve engineering standards, tooling, and processes working closely with multiple teams.
- Challenge the status quo and determination to lead through change using a bold, fail-fast mentality.
Expectation from candidate role:
- 2 - 3 years of experience
- Professional experience with any of the cloud environments (Azure / AWS / Google)
- Experience ensuring the resilience of scalable cloud native solutions
- Experience working with infrastructure as code tools like - Terraform, Ansible, Puppet, Chef, etc
- Experience working with CI/CD tools like - Jenkins, Git, Github, etc
- Experience working with SQL and NoSQL databases
- Exposure on centralized logging solutions like - Graylog, Splunk, Elk, etc
- Exposure on active monitoring solutions like Grafana, New Relic, DataDog, Prometheus, etc
- Strong scripting skills (Bash, Powershell)
- Strong knowledge on container technologies like Docker, Kubernetes
- Knowledge of networking principles & understanding of IT security best practices [firewalls, load balancing, routing and switching]
- Knowledge of horizontal and vertical scaling best practices
- Good interpersonal, communication and organizational skills
Must have Skills:
- Understanding & working knowledge in Azure
- Knowledge of IAC Terraform
- Knowledge of AKS
Good to have Skills:
- Knowledge of networking principles & understanding of IT security best practices [firewalls, load balancing, routing and switching]
- Exposure on active monitoring solutions like Grafana, New Relic, DataDog, Prometheus, etc,
- Knowledge of horizontal and vertical scaling best practices