12 to 14 years of experience in Site Reliability Engineering, DevOps, or a related field, with at least 3 years in a senior or architect-level role.
- Strong expertise in system architecture, distributed systems, cloud computing (e.g., AWS, Azure, GCP), containerization (e.g., Docker, Kubernetes), and infrastructure as code (e.g., Terraform, Ansible).
- Proficiency in one or more programming/scripting languages (e.g., Python, Groovy, Shell, Powershell or similar).
- Strong background of DevOps practices, Cloud Technologies in ensuring scalability, reliability and security of Cloud infrastructure
- Experience with monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana, ELK stack, Datadog).
- Experience in integrating SRE with backend technologies like databases, messaging systems, etc. Strong understanding of software engineering principles and practices
- Deep understanding of incident management, root cause analysis, and post-incident review processes.
- Involvement in setting strategic direction for SRE practices, leading technical initiatives, and promoting a culture of excellence in site reliability engineering.
- Excellent problem-solving and communication skills and ability to work collaboratively in a fast-paced and dynamic environment.
- Proven ability to lead technical projects, influence cross-functional teams, and drive change.
- Excellent verbal and written communication skills, with the ability to articulate complex technical concepts to both technical and non-technical audiences.
- Certifications in relevant technologies like Cloud certified DevOps Architect, Cloud Operations Support Architect etc.
Key Responsibilities:
- Architecting Systems: Design and architect highly available, scalable, and resilient systems to meet the demands of our growing user base and evolving business needs.
- Reliability Engineering: Develop and implement strategies to improve system reliability, including incident management, monitoring, and automated remediation.
- Performance Optimization: Identify and address performance bottlenecks, optimize system performance, and ensure efficient resource utilization.
- Collaboration: Partner with development teams, product managers, and other stakeholders to integrate SRE practices into the development lifecycle and ensure alignment with business objectives.
- Automation: Drive automation initiatives to reduce manual intervention, increase efficiency, and improve system reliability.
- Incident Management: Lead post-incident reviews, root cause analysis, and develop strategies for preventing future incidents.
- Best Practices: Establish and enforce best practices for system design, monitoring, and incident management.
- Mentorship: Provide guidance and mentorship to junior SREs and engineering teams on SRE principles and practices.
Qualifications:
- Experience: 8+ years of experience in Site Reliability Engineering, DevOps, or a related field, with at least 3 years in a senior or architect-level role.
- Technical Skills: SProgramming: Proficiency in one or more programming languages (e.g., Python, Go, Java, or similar).
- Monitoring Tools: Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog).
- Incident Response: Leadership: Proven ability to lead technical projects, influence cross-functional teams, and drive change.
- Communication:
Preferred Qualifications:
- Certifications: Relevant certifications (e.g., AWS Certified Solutions Architect, Google Professional Cloud Architect) are a plus.
- Experience: Previous experience in high-growth or high-availability environments.