Senior Cloud Engineer

4-8 Years

Save

Early Applicant

Job Description

Responsibilities

Design, deploy, and manage high-availability, scalable cloud infrastructure using AWS services, with a focus on EKS, MongoDB, Cassandra, and Kafka.
Develop and implement automation scripts using Terraform, GitOps, ArgoCD, and other IaC tools to provision, configure, and maintain infrastructure.
Ensure zero-downtime deployments through efficient CI/CD pipelines and automated rollbacks.
Analyse, monitor, and improve infrastructure performance, scaling, and reliability to meet high availability and disaster recovery requirements.
Create reusable integrations with third-party tools like CI/CD systems, monitoring solutions, and container registries to optimise and consolidate workflows.
Troubleshoot and resolve infrastructure and deployment issues, ensuring rapid response and root cause analysis (RCA) for production incidents.
Participate in an on-call rotation to provide 24/7 support for critical systems as required.
Collaborate with cross-functional teams to implement best practices around observability, monitoring, and logging.
Document infrastructure processes, configurations, and operational runbooks to support a knowledge-sharing culture.

Requirements

4-8 years of professional experience in DevOps or software engineering roles, with a focus on configuring, deploying, and maintaining Kubernetes in AWS.
Strong proficiency in infrastructure as code (IaC) using Terraform, AWS CloudFormation, or similar tools.
Experience with scripting and automation using languages such as Python.
Experience with CI/CD pipelines and automation tools such as Concourse, Jenkins, or Ansible.
Experience with teams having delivered observability and telemetry tools and practices, such as Prometheus, Grafana, ELK stack, distributed tracing, and performance monitoring.
Experience with cloud-native tools such as Istio, Argo CD, External Secrets Operator, Keda, Karpenter, etc.
Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.
Concepts of SLI, SLO, and SLA define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets.
Excellent problem-solving skills and attention to detail.
Experience with service mesh technologies like Istio.
Familiarity with tools like External Secrets Operator, Keda, or Karpenter for scaling workloads.
Certifications such as AWS Certified Solutions Architect or Kubernetes Administrator.

This job was posted by Mansi Shah from Caizin.