Job Title: SRE Practice Lead
Skills: Site Reliability Engineering (SRE), Reliability Engineering, Engineering‑led Operations, Reliability‑first operating model, Error Budgets, SLIs / SLOs / SLAs, MTTR / MTTD, Incident Management, RCA
Experience: 18+ years
Location: Greater Noida
Role Overview: We are seeking an accomplished SRE Practice Leader to drive the next wave of transformation in reliability engineering, modern operations, and AI-augmented engineering practices. The ideal candidate brings deep expertise in building resilient, scalable platforms while also shaping enterprise-wide engineering standards and influencing transformation programs. This role will champion the shift from traditional operations to SRE-driven, automation-first, AI-enabled modern managed services, helping customers adopt progressive operating models designed for speed, reliability, and efficiency. The SRE Practice Lead will not only lead technical strategy but also elevate our engineering capabilities, contribute to next-generation frameworks and accelerators, and play a pivotal role in guiding customers through modernization journeys.
SRE Leadership:
- Define, develop, and scale SRE strategies, frameworks, and best practices across large, complex environments
- Drive customer transition from traditional AMS toward engineering led, reliability first, automation driven operating models, including AI led SRE implementations
- Architect and build highly available, resilient, scalable, and self healing systems across distributed and cloud-native environments
- Establish automation-first approaches for provisioning, configuration management, deployment pipelines, and operational workflows
- Lead the implementation of advanced observability—metrics, logs, traces, APM—and modern alerting practices supporting proactive reliability
- Apply cloud-native technologies (Kubernetes, containers, serverless, service mesh) to build high performance, decoupled architectures
- Integrate intelligent automation, AI/ML insights, and automated incident workflows to improve MTTR and reduce manual toil
- Optimize cloud resource utilization using data-driven approaches—enabling cost efficiency, elasticity, and predictive scaling
Enterprise Architecture
- Partner closely with enterprise architecture teams to embed SRE principles into core technology strategies, platforms, and operating models
- Drive standardization by defining SRE reference architectures, engineering guidelines, runbooks, and reusable patterns
- Ensure SRE frameworks integrate seamlessly with existing systems, business domains, and modernization roadmaps
- Evaluate emerging technologies and guide their adoption within engineering and operations ecosystems
Customer Engagement & Presales
- Participate in presales, showcasing engineering depth through solution proposals, demos, benchmarks, and proofs of concept
- Advise customers through assessments, maturity roadmaps, and tailored SRE modernization strategies
- Articulate the business value of reliability engineering, observability, and automation in the context of large-scale transformation programs
Teamwork & Collaboration
- Lead, mentor, and coach engineering teams to develop deep SRE competencies across automation, observability, performance engineering, and cloud-native practices
- Foster strong relationships with clients and internal stakeholders
- Collaborate with cross-functional teams across development, architecture, platform engineering, DevOps, and security to deliver unified outcomes
- Mentor and guide junior team members
Desired Skills
- Minimum of 15 years of experience Site Reliability Engineering or a related field
- Strong expertise across AWS, Azure, and/or GCP, including design of multi-cloud, hybrid, and distributed architectures
- Modern DevOps mindset using best of breed open source and leading Infrastructure as Code SCM tools for example Terraform and Ansible Experience administering high availability, high performance environments and managing large scale, traffic-intensive applications
- Hands on experience with Docker and Kubernetes and their corresponding provider management services
- Excellent understanding of Scalability processes and techniques
- Proven ability to work remotely with teams of various sizes in same/different time zones, from anywhere and remain highly motivated, productive, and organized
- Monitoring and Logging Experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, Splunk, or similar 4 Ⓒ COFORGE | Job Description | Confidential
- Strong problem-solving skills and experience with incident management and root cause analysis
- Knowledge of performance tuning and optimization techniques for various systems and applications
- Strong documentation skills Ability to create clear and concise technical documentation