Site Reliability Engineer

Confidential

Noida, India

18-20 Years

Save

Posted 19 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

SRE Practice Leader

Location: Greater Noida

Experience: 18+ years

Role Overview

We are seeking an accomplished SRE Practice Leader to drive the next wave of transformation in reliability engineering, modern operations, and AI-augmented engineering practices. The ideal candidate brings deep expertise in building resilient, scalable platforms while also shaping enterprise-wide engineering standards and influencing transformation programs.

This role will champion the shift from traditional operations to SRE-driven, automation-first, AI-enabled modern managed services, helping customers adopt progressive operating models designed for speed, reliability, and efficiency.

The SRE Practice Leader will not only lead technical strategy but also elevate our engineering capabilities, contribute to next-generation frameworks and accelerators, and play a pivotal role in guiding customers through modernization journeys.

Responsibilities

Define, develop, and scale SRE strategies, frameworks, and best practices across large, complex environments.
Drive customer transition from traditional managed services toward engineering-led, reliability-first, automation-driven operating models, including AI-led SRE implementations.
Architect and build highly available, resilient, scalable, and self-healing systems across distributed and cloud-native environments.
Establish automation-first approaches for provisioning, configuration management, deployment pipelines, and operational workflows.
Lead the implementation of advanced observability, including metrics, logs, traces, APM, and modern alerting practices that support proactive reliability.
Apply cloud-native technologies such as Kubernetes, containers, serverless, and service mesh to build high-performance, decoupled architectures.
Integrate intelligent automation, AI/ML insights, and automated incident workflows to improve MTTR and reduce manual toil.
Optimize cloud resource utilization using data-driven approaches to enable cost efficiency, elasticity, and predictive scaling.
Partner closely with enterprise architecture teams to embed SRE principles into core technology strategies, platforms, and operating models.
Drive standardization by defining SRE reference architectures, engineering guidelines, runbooks, and reusable patterns.
Ensure SRE frameworks integrate seamlessly with existing systems, business domains, and modernization roadmaps.
Evaluate emerging technologies and guide their adoption within engineering and operations ecosystems.
Participate in presales, showcasing engineering depth through solution proposals, demos, benchmarks, and proofs of concept.
Advise customers through assessments, maturity roadmaps, and tailored SRE modernization strategies.
Articulate the business value of reliability engineering, observability, and automation in the context of large-scale transformation programs.
Lead, mentor, and coach engineering teams to develop deep SRE competencies across automation, observability, performance engineering, and cloud-native practices.
Foster strong relationships with clients and internal stakeholders.
Collaborate with cross-functional teams across development, architecture, platform engineering, DevOps, and security to deliver unified outcomes.
Mentor and guide junior team members.

Requirements

Minimum of 15 years of experience in Site Reliability Engineering or a related field.
Strong expertise across AWS, Azure, and/or GCP, including design of multi-cloud, hybrid, and distributed architectures.
Modern DevOps mindset using best-of-breed open source and leading Infrastructure as Code and SCM tools, for example Terraform and Ansible.
Experience administering high-availability, high-performance environments and managing large-scale, traffic-intensive applications.
Hands-on experience with Docker and Kubernetes and their corresponding provider management services.
Excellent understanding of scalability processes and techniques.
Proven ability to work remotely with teams of various sizes in same or different time zones, from anywhere, while remaining highly motivated, productive, and organized.
Monitoring and logging experience with tools such as Prometheus, Grafana, ELK Stack, Splunk, or similar.
Strong problem-solving skills and experience with incident management and root cause analysis.
Knowledge of performance tuning and optimization techniques for various systems and applications.
Strong documentation skills and the ability to create clear and concise technical documentation.