Senior Site Reliability Engineer

Snapmint

Gurugram, Gurugram, India

4-6 Years

Save

Posted 3 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Senior Site Reliability Engineer (SRE)

Summary

We are looking for a Senior Site Reliability Engineer (SRE) to build and operate scalable, reliable, and secure platform infrastructure. The ideal candidate will drive automation, observability, incident management, and cloud-native best practices to improve system reliability and operational excellence across distributed systems.

Roles & Responsibilities

Define and manage SLIs, SLOs, and error budgets for critical services
Design and enhance monitoring, logging, alerting, and tracing capabilities
Automate operational processes and improve platform efficiency
Participate in incident response, root cause analysis (RCA), and postmortem reviews
Support production environments through on-call rotations and reliability initiatives
Improve system performance, scalability, availability, and capacity planning
Collaborate with engineering teams to enhance application resiliency and operational readiness
Drive adoption of Infrastructure as Code (IaC) and CI/CD best practices
Maintain highly available, fault-tolerant, and secure cloud infrastructure

Skills

Strong Linux/Unix administration and Debugging skills
Proficiency in Python/Bash/Shell scripting and automation
Expertise in observability and monitoring tools such as Grafana, Prometheus, ELK, and New Relic
Strong expertise in AWS and cloud infrastructure management
Strong experience with log analysis and monitoring using ELK
Strong incident management, communication, and operational excellence mindset
Hands-on experience with Kubernetes, Docker, and container orchestration
Experience with Terraform and Infrastructure as Code practices
Strong understanding of networking, DNS, load balancing, and distributed systems
Experience with CI/CD tools such as Jenkins, GitHub Actions, GitLab CI, or ArgoCD