Responsibilities
- Design, deploy, and manage high-availability, scalable cloud infrastructure using AWS services, with a focus on EKS, MongoDB, Cassandra, and Kafka.
- Develop and implement automation scripts using Terraform, GitOps, ArgoCD, and other IaC tools to provision, configure, and maintain infrastructure.
- Ensure zero-downtime deployments through efficient CI/CD pipelines and automated rollbacks.
- Analyse, monitor, and improve infrastructure performance, scaling, and reliability to meet high availability and disaster recovery requirements.
- Create reusable integrations with third-party tools like CI/CD systems, monitoring solutions, and container registries to optimise and consolidate workflows.
- Troubleshoot and resolve infrastructure and deployment issues, ensuring rapid response and root cause analysis (RCA) for production incidents.
- Participate in an on-call rotation to provide 24/7 support for critical systems as required.
- Collaborate with cross-functional teams to implement best practices around observability, monitoring, and logging.
- Document infrastructure processes, configurations, and operational runbooks to support a knowledge-sharing culture.
Requirements
- 4-8 years of professional experience in DevOps or software engineering roles, with a focus on configuring, deploying, and maintaining Kubernetes in AWS.
- Strong proficiency in infrastructure as code (IaC) using Terraform, AWS CloudFormation, or similar tools.
- Experience with scripting and automation using languages such as Python.
- Experience with CI/CD pipelines and automation tools such as Concourse, Jenkins, or Ansible.
- Experience with teams having delivered observability and telemetry tools and practices, such as Prometheus, Grafana, ELK stack, distributed tracing, and performance monitoring.
- Experience with cloud-native tools such as Istio, Argo CD, External Secrets Operator, Keda, Karpenter, etc.
- Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.
- Concepts of SLI, SLO, and SLA define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets.
- Excellent problem-solving skills and attention to detail.
- Experience with service mesh technologies like Istio.
- Familiarity with tools like External Secrets Operator, Keda, or Karpenter for scaling workloads.
- Certifications such as AWS Certified Solutions Architect or Kubernetes Administrator.
This job was posted by Mansi Shah from Caizin.