We are looking for a passionate and detail-oriented
Site Reliability Engineer (SRE) with 36 years of experience to help build and maintain highly reliable, scalable, and performant systems. In this role, you will work at the intersection of development and operations, driving automation, improving system resilience, and ensuring seamless user experiences.
Key Responsibilities
- Ensure high availability, reliability, and performance of applications and infrastructure
- Build and maintain scalable systems and automation frameworks
- Define and monitor SLIs, SLOs, and SLAs
- Implement effective monitoring, alerting, and incident response mechanisms
- Troubleshoot production issues and conduct root cause analysis (RCA)
- Collaborate with development teams to improve system design and resilience
- Automate repetitive tasks and reduce manual interventions
- Manage cloud infrastructure (AWS / Azure / GCP)
- Drive continuous improvement in system reliability and operational efficiency
Required Skills
- Strong experience in Linux/Unix systems
- Hands-on experience with cloud platforms (AWS, Azure, or GCP)
- Proficiency in monitoring tools (Prometheus, Grafana, ELK stack)
- Experience with containerization & orchestration (Docker, Kubernetes)
- Knowledge of CI/CD tools (Jenkins, GitHub Actions, GitLab CI)
- Strong scripting skills (Python, Bash, or Go)
- Understanding of networking, distributed systems, and system architecture
- Experience in incident management and production support
Preferred Skills
- Experience with Infrastructure as Code (Terraform, CloudFormation)
- Familiarity with chaos engineering and performance testing
- Exposure to DevSecOps practices
- Knowledge of microservices architecture
- Certifications in AWS / Kubernetes / SRE practices are a plus
Education
- Bachelor's degree in Computer Science, IT, or a related field
Technology: Devops
Job Type: Full Time
Job Location: Bangalore Gurgaon Hyderabad Mumbai
Work Mode: Onsite
Experience: 3 to 6 Years
Work Shift: India