Site Reliability Engineer Practice Leader

Niit Technologies

Noida, India

18-20 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Title: SRE Practice Lead

Skills: Site Reliability Engineering (SRE), Reliability Engineering, Engineering‑led Operations, Reliability‑first operating model, Error Budgets, SLIs / SLOs / SLAs, MTTR / MTTD, Incident Management, RCA

Experience: 18+ years

Location: Greater Noida

Role Overview: We are seeking an accomplished SRE Practice Leader to drive the next wave of transformation in reliability engineering, modern operations, and AI-augmented engineering practices. The ideal candidate brings deep expertise in building resilient, scalable platforms while also shaping enterprise-wide engineering standards and influencing transformation programs. This role will champion the shift from traditional operations to SRE-driven, automation-first, AI-enabled modern managed services, helping customers adopt progressive operating models designed for speed, reliability, and efficiency. The SRE Practice Lead will not only lead technical strategy but also elevate our engineering capabilities, contribute to next-generation frameworks and accelerators, and play a pivotal role in guiding customers through modernization journeys.

SRE Leadership:

Define, develop, and scale SRE strategies, frameworks, and best practices across large, complex environments
Drive customer transition from traditional AMS toward engineering led, reliability first, automation driven operating models, including AI led SRE implementations
Architect and build highly available, resilient, scalable, and self healing systems across distributed and cloud-native environments
Establish automation-first approaches for provisioning, configuration management, deployment pipelines, and operational workflows
Lead the implementation of advanced observability—metrics, logs, traces, APM—and modern alerting practices supporting proactive reliability
Apply cloud-native technologies (Kubernetes, containers, serverless, service mesh) to build high performance, decoupled architectures
Integrate intelligent automation, AI/ML insights, and automated incident workflows to improve MTTR and reduce manual toil
Optimize cloud resource utilization using data-driven approaches—enabling cost efficiency, elasticity, and predictive scaling

Enterprise Architecture

Partner closely with enterprise architecture teams to embed SRE principles into core technology strategies, platforms, and operating models
Drive standardization by defining SRE reference architectures, engineering guidelines, runbooks, and reusable patterns
Ensure SRE frameworks integrate seamlessly with existing systems, business domains, and modernization roadmaps
Evaluate emerging technologies and guide their adoption within engineering and operations ecosystems

Customer Engagement & Presales

Participate in presales, showcasing engineering depth through solution proposals, demos, benchmarks, and proofs of concept
Advise customers through assessments, maturity roadmaps, and tailored SRE modernization strategies
Articulate the business value of reliability engineering, observability, and automation in the context of large-scale transformation programs

Teamwork & Collaboration

Lead, mentor, and coach engineering teams to develop deep SRE competencies across automation, observability, performance engineering, and cloud-native practices
Foster strong relationships with clients and internal stakeholders
Collaborate with cross-functional teams across development, architecture, platform engineering, DevOps, and security to deliver unified outcomes
Mentor and guide junior team members

Desired Skills

Minimum of 15 years of experience Site Reliability Engineering or a related field
Strong expertise across AWS, Azure, and/or GCP, including design of multi-cloud, hybrid, and distributed architectures
Modern DevOps mindset using best of breed open source and leading Infrastructure as Code SCM tools for example Terraform and Ansible Experience administering high availability, high performance environments and managing large scale, traffic-intensive applications
Hands on experience with Docker and Kubernetes and their corresponding provider management services
Excellent understanding of Scalability processes and techniques
Proven ability to work remotely with teams of various sizes in same/different time zones, from anywhere and remain highly motivated, productive, and organized
Monitoring and Logging Experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack, Splunk, or similar 4 Ⓒ COFORGE | Job Description | Confidential
Strong problem-solving skills and experience with incident management and root cause analysis
Knowledge of performance tuning and optimization techniques for various systems and applications
Strong documentation skills Ability to create clear and concise technical documentation