Site Reliability Engineer

spot your leaders & consulting

Pune, India

9-15 Years

Save

Posted 3 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Description: Site Reliability Engineer (SRE)

(Notice period - Immediate or maximum 30 days)

Total years of experience 9- 15 yrs
Need to have experience or exposure in Chaos Engineering or Resilience Testing.
Hands on experience in Python/ Bash
Hands on experience in Ansible, Gitlab, CI/CD, Gitlab Pages, Jenkins, Terraform
Hands on experience in Azure

Required Skills & Experience

Strong experience in Core SRE practices, including reliability engineering, incident management, and automation.
Proven hands-on experience in Performance Engineering / Performance Testing for large-scale distributed systems.
Deep understanding and implementation experience with SLI / SLO / Error Budget frameworks.
Hands-on experience with containerization and orchestration (Docker, Kubernetes).
Strong background in monitoring, observability, and logging
Tools such as Prometheus, Grafana, Datadog, Splunk, ELK Stack.
Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Proficiency in scripting and automation using Python, Bash, Terraform, Ansible.
Strong troubleshooting skills across application, infrastructure, and network layers.
Experience designing and running incident response and post-mortem reviews.
Ownership mindset with accountability for service reliability and customer outcomes.
Excellent communication, collaboration, and stakeholder management skills.

Nice to Have (SRE+ Skills)

Experience with Keptn or similar tools for automated SLO-based quality gates and continuous delivery.
Programming experience in Java, especially for debugging, performance profiling, or building automation tools.
Familiarity with chaos engineering practices and tools.
Experience working in banking, payments, or capital markets domains.
Knowledge of security best practices and regulatory compliance in enterprise environment

Responsibilities

What You Will Be Doing

Core SRE & Reliability Engineering

Design, implement, and operate highly available, resilient, and scalable systems aligned with SRE best practices.
Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance reliability and delivery velocity.
Build and maintain service health dashboards to provide real-time visibility into platform stability and customer experience.
Reduce toil through extensive automation of operational workflows, alerts, and remediation activities.

Monitoring, Observability & Service Health

Design and maintain end-to-end monitoring and observability solutions covering infrastructure, applications, APIs, and user journeys.
Implement advanced alerting strategies to reduce noise and improve mean time to detect (MTTD) and mean time to resolution (MTTR).
Leverage metrics, logs, and traces to drive root cause analysis and proactive incident prevention.
Enable reliability reporting for stakeholders using SLO compliance and service health metrics.

Performance Engineering & Testing

Lead performance engineering initiatives, including load testing, stress testing, endurance testing, and capacity validation.
Identify performance bottlenecks across application, middleware, database, and infrastructure layers.
Conduct capacity planning and performance tuning to support business growth and peak traffic scenarios.
Partner with development and QA teams to embed performance testing into CI/CD pipelines.

Incident Management & Operations

Lead and participate in incident response activities, including triage, mitigation, recovery, and post-incident reviews.
Drive blameless post-mortems and ensure corrective actions are tracked to completion.
Participate in on-call rotations, providing 24x7 support for critical production systems.
Continuously improve operational readiness and resilience.

Automation, CI/CD & Cloud Operations

Design and manage deployment pipelines, configuration management, and environment consistency across lower and production environments.
Implement Infrastructure as Code (IaC) practices for repeatable and secure cloud provisioning.
Collaborate with DevOps teams to improve deployment reliability, rollback mechanisms, and release safety.
Develop and test disaster recovery plans, backup strategies, and failover mechanisms.

Collaboration & Governance

Work closely with Development, QA, DevOps, Security, and Product teams to align on reliability and performance goals.
Ensure platforms meet security, compliance, and regulatory requirements common in financial services.
Act as a reliability and performance advocate throughout the SDLC.

More Info

Job Type:

Permanent Job

Industry:

Other

Function:

Site Reliability Engineering

Employment Type:

Full time

About Company

spot your leaders & consultingJob Source: www.linkedin.com

Job ID: 148223233

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 23-05-2026 07:56:31 PM

Homejobs in PuneSite Reliability Engineer

Similar Jobs

Site Reliability Engineer

armor defense

8-10 yrs

Pune, India

Skills:

VMware, PowerShell, Prometheus, Grafana, Office 365, Datadog, Jenkins, Git, Azure Ad, Terraform, Ansible, Splunk, Kubernetes, Python, AWS, Zerto, Microsoft Entra ID, NSX-T, Active Directory, GitLab CI, oci, Rubrik

Site Reliability Engineer

Persistent Systems

5-9 yrs

Pune, India

Skills:

Git, Appdynamics, Docker, Linux, Ansible, Prometheus, Splunk, Grafana, Kubernetes, AWS

Site Reliability Engineer

kyndryl india

10-12 yrs

Pune, India

Skills:

Rhel, Load Balancing, UNIX, Ansible, Linux Server, Windows Server, Kubernetes, Python, Solaris, Application Monitoring, Terraform, Aix, Packet Level Analysis, F5 Administration

CCTech - Lead Site Reliability Engineer - IAC Terraform

centre for computational technologies (cctech)

8-12 yrs

Pune, India

Skills:

Terraform, Node.js, Distributed Systems, Python, AWS, Go, Infrastructure-as-Code, Observability, Monitoring

Lead Site Reliability Engineer

Qualys

10-12 yrs

Pune, India

Skills:

Java, Performance Tuning, Sql, Nosql, File System, Kubernetes, Python, Alerting, Go, Network stack, High-availability, Disaster Recovery, JVM concepts, OS services, Monitoring

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile

Real-time notifications

Discover new jobs, get recruiter notifications, track applications & more with the foundit App.

Scan to download foundit App