Senior Site Reliability Engineer

Saarthee

Bengaluru, India

6-8 Years

Save

Posted 2 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Position Summary:

We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.

Job Responsibilities :

Design, implement, and operate highly available and resilient Kubernetes-based systems.
Define, monitor, and enforce SLIs, SLOs, and error budgets to ensure service reliability.
Lead incident response, root cause analysis (RCA), and postmortems, driving continuous improvement.
Architect and manage observability platforms for metrics, logging, tracing, and alerting.
Work hands-on with Prometheus, Alertmanager, OpenTelemetry, Grafana, and Loki / ELK / OpenSearch.
Implement cloud-native monitoring and logging, with preference for GCP Cloud Monitoring & Logging.
Establish actionable alerting standards to reduce noise and improve response effectiveness.
Build and manage cloud infrastructure on GCP (preferred) or AWS.
Operate and scale Kubernetes clusters (GKE preferred) and deploy services using Helm.
Manage containerized workloads using Docker.
Develop automation and internal tooling using Python to improve reliability and observability.
Integrate CI/CD pipelines with reliability and monitoring checks.
Mentor junior engineers, influence architectural decisions, and collaborate across engineering teams.

Required Skills and Qualifications:

6+ years of experience as a DevOps Engineer, SRE, or related software engineering role, supporting production-grade systems.
Strong hands-on experience with cloud infrastructure on GCP (preferred) or AWS.
Proven expertise in operating Kubernetes-based platforms in production environments (GKE preferred).
Solid experience designing and maintaining highly available and resilient systems using SRE best practices.
Hands-on knowledge of SLIs, SLOs, error budgets, and reliability engineering principles.
Strong experience with observability and monitoring tools, including Prometheus, Grafana, Alertmanager, OpenTelemetry, and log platforms such as Loki / ELK / OpenSearch.
Demonstrated experience in incident management, on-call support, root cause analysis, and postmortems.
Proficiency in automation and tooling using Python, with additional scripting experience in Shell or Groovy.
Experience integrating CI/CD pipelines (Jenkins, GitHub) with deployment, monitoring, and reliability checks.
Strong understanding of microservices architectures, distributed systems, and containerized workloads.
Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform or CloudFormation.
Good knowledge of cloud networking, security fundamentals, and access controls.
Strong analytical and problem-solving skills with a proactive operational mindset.
Excellent communication skills and the ability to collaborate effectively with cross-functional engineering teams.