Key Responsibilities
- Design, deploy, and manage infrastructure on AWS (EC2, VPC, ALB, IAM, Route53) Operate and maintain Kubernetes clusters (EKS and kubeadm) using Helm and ArgoCD Build, optimize, and maintain CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI) Automate infrastructure provisioning using Terraform, with modular and version-controlled setups
- Implement and monitor observability systems using Prometheus, Grafana, Loki, or ELK stack
- Manage production incidents, perform root cause analysis, and implement preventive actions
- Enforce security best practices with IAM, HashiCorp Vault, TLS, and access controls
- Collaborate with engineering teams to ensure deployment hygiene, cost efficiency, and system scalability
Technical Requirements
- Cloud & Infrastructure
- AWS (EC2, VPC, IAM, ALB, CloudWatch, Route53), DNS, NAT, routing
- Containers & Orchestration
- Kubernetes (EKS preferred), kubeadm, Helm, ArgoCD, GitOps workflows
- Infrastructure as Code & Automation
- Terraform (modular, environment-specific), Bash scripting, YAML, JSON, basic Python CI/CD
- GitHub Actions, GitLab CI, Jenkins
- Monitoring & Observability
- Prometheus, Grafana, Loki, ELK stack, SLO/SLA implementation, latency/P99 tracking Security
- IAM, Vault, network security groups, TLS, least-privilege access enforcement
Preferred Experience & Traits
- Prior experience operating production-grade systems and Kubernetes clusters
- Strong understanding of cloud networking, VPC/subnet design, and security configurations
- Ability to debug real-time incidents and proactively optimize system reliability
- Independent ownership mindset with strong collaboration and communication skills
- Exposure to hybrid or co-located infrastructure environments is a plus