
Search by job, company or skills
Job Description: Site Reliability Engineering (SRE) Manager
Role Overview:
We are looking for an experienced SRE Manager to lead our Site Reliability Engineering team. The ideal candidate will have a strong background in DevOps practices, system reliability, and team leadership.
Key Responsibilities:
- Lead, mentor, and manage a team of SRE/DevOps engineers
- Define and implement SRE best practices (SLIs, SLOs, error budgets)
- Ensure system reliability, scalability, and performance
- Drive automation initiatives
- Collaborate with cross-functional teams
- Own CI/CD pipelines and release management
- Lead incident response and RCA processes
- Establish monitoring and observability frameworks
- Manage cloud infrastructure (AWS/Azure/GCP)
- Implement disaster recovery plans
Required Skills & Qualifications:
- 7+ years of experience in SRE/DevOps roles
- 3+ years of team management experience
- Experience with cloud platforms (AWS/Azure/GCP)
- Knowledge of CI/CD tools (Jenkins, GitLab CI)
- Experience with Docker and Kubernetes
- Scripting skills (Python, Bash)
- Knowledge of Terraform/CloudFormation
- Monitoring tools (Prometheus, Grafana, ELK)
Preferred Qualifications:
- Experience with microservices
- Cloud certifications are a plus
- Strong problem-solving skills
Key Competencies:
- Leadership
- Communication
- Ownership
- Stakeholder management
Good to Have:
- Experience in e-commerce platforms
- Knowledge of chaos engineering
Job ID: 147206893
Skills:
Kubernetes, Docker, Terraform, Cloud Services, CI/CD, Monitoring
Skills:
Dns, Route 53, Ansible, Linux Server, PowerShell, Ldap, Gitlab, AWS, Python, Bash, Servicenow, Azure, Ec2, Terraform, Git, observability tools, EKS, FSX, AD, DevOps tooling, AWS Compute services, Network Storage, Managed AD
Skills:
Terraform, Kubernetes, Python, AWS architecture, Amazon EKS, Go, Istio
Skills:
Servicenow, Networking, Datadog, cloud, Terraform, Docker, Splunk, automation, Azure, Python, Kubernetes, AWS, Entra ID, policy-as-code, HashiCorp Vault, PagerDuty, IaC, Zero Trust, Infrastructure-as-Code, AIOps, observability
Skills:
Networking, Splunk, automation, Datadog, AWS, cloud, Python, Azure, Gcp, Terraform, observability, IaC, Alibaba Cloud, Infrastructure-as-Code, Zero Trust, CI CD, Entra ID, OIDC, security tooling
We don’t charge any money for job offers