Site Reliability Engineer (SRE) AWS, Kubernetes (EKS), CI/CD
Location: Remote |
Experience: 46 years
Role Summary
Looking for an SRE to ensure reliability, scalability, and automation of AWS + Kubernetes platforms. Responsibilities include CI/CD, infrastructure provisioning, monitoring, incident management, and collaborating with engineering teams to deliver secure, high-availability systems.
Key Responsibilities
- Reliability & Ops: Manage availability, performance, SLIs/SLOs, incident response, and improve MTTR.
- Kubernetes (EKS): Operate EKS clusters, deployments (Helm/Kustomize), autoscaling, ingress, and security policies.
- CI/CD: Build and maintain pipelines (GitHub Actions/Jenkins/GitLab), enforce best practices, and manage releases.
- IaC & AWS: Provision infrastructure using Terraform/CloudFormation; manage AWS services (EKS, EC2, IAM, VPC, RDS, S3, etc.) and cost optimization.
- Observability: Implement monitoring, logging, and tracing (Prometheus, Grafana, CloudWatch).
- Security: Enforce IAM, WAF, encryption, and vulnerability management.
- Collaboration: Maintain runbooks, SOPs, and work with teams on design and deployment.
Must-Have Skills
- AWS (EKS, EC2, IAM, VPC, etc.)
- Kubernetes (production experience)
- CI/CD pipelines & release management
- Observability (Prometheus, Grafana)
- Terraform (or CloudFormation/CDK)
- Linux, networking, Python/Bash
Good-to-Have
- Service mesh (Istio/Linkerd), GitOps (Argo CD/Flux)
- Tools like Datadog/New Relic
- WAF tuning, DB basics (Postgres/MySQL)
- High-volume data systems experience