Senior Site Reliability Engineer

Headout

Bengaluru, India

2-5 Years

Save

Posted 17 days ago
Be among the first 10 applicants

Early Applicant

Job Description

The Role

As a Senior Site Reliability Engineer, you will own and operate cloud-native infrastructure and Kubernetes platforms that power customer-facing services at scale. You will design and optimize CI/CD workflows, improve deployment reliability, and drive observability, incident management, and performance improvements across the organization. You will build platform tooling to improve developer velocity, enforce security guardrails, and standardize best practices. This role expects strong ownership, architectural thinking, and mentorship of junior engineers.

What makes this role special

Full Platform Exposure Work across DevOps, infrastructure, observability, performance, and reliability
Architecture Ownership Influence platform and tooling decisions using benchmarks and metrics
High Impact Build systems that reduce deployment TAT, improve p99s, and scale across teams
Flexibility Freedom to work across stacks, tools, and evolving platforms

What skills & experience do you nee

25 years of experience operating customer-facing services at scale
Strong hands-on experience with Kubernetes cluster operations and workload optimization
Experience with service mesh and distributed tracing tools (e.g., Istio, Jaeger)
Comfortable with at least one cloud provider (AWS preferred; GCP or Azure acceptable)
Hands-on experience with monitoring and alerting stacks (Prometheus, Grafana, Thanos, Datadog, New Relic)
Proven experience designing robust CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Proficiency in Infrastructure as Code (Terraform or Pulumi)
Strong programming skills in Python, Go, or Java/Kotlin, plus shell scripting
Experience with databases such as MySQL and MongoDB, including application and query profiling
Solid understanding of security best practices and compliance
High-ownership mindset with the ability to proactively identify and resolve platform issues