Senior Site Reliability Engineer

acquirex

Pune, India

10-12 Years

Save

Posted 2 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Role: Senior Site Reliability Engineer (SRE) / DevOps Engineer

Location: Viman Nagar, Pune (Work From Office)

Experience: 10+ Years

Timings: 3:00 PM – 12:00 AM (Monday - Friday)

About The Role

On-Call Rotation Required (24/7 Production Support)

We are seeking a highly experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer responsible for ensuring the reliability, scalability, security, observability, and performance of mission-critical production systems. This role combines strong DevOps automation expertise with true SRE ownership, including incident management, on-call participation, reliability engineering, observability, root cause analysis, and proactive system improvements.

The ideal candidate will have deep expertise in Microsoft Azure, Kubernetes, OpenTelemetry, Golden Signals monitoring, and modern SRE practices. They should be capable of balancing incident response with long-term engineering initiatives that improve reliability, reduce operational toil, and strengthen system resilience.

Key Responsibilities

1. Incident Response & On-Call Ownership

Participate in 24/7 on-call rotation for production systems.
Rapidly diagnose, mitigate, and resolve high-severity production incidents.
Lead Root Cause Analysis (RCA) and post-mortem documentation.
Implement corrective and preventive actions to avoid recurrence.
Drive improvements in Mean Time to Recovery (MTTR).
Maintain and improve SLAs, SLOs, and reliability objectives.

2. Reliability Engineering & SRE Practices

Implement and champion Site Reliability Engineering (SRE) best practices.
Define, measure, and improve SLIs, SLOs, SLAs, and Error Budgets.
Engineer solutions to eliminate repetitive operational work (toil reduction).
Conduct reliability reviews and capacity planning exercises.
Improve system redundancy, failover strategies, and disaster recovery readiness.
Continuously improve service availability, latency, and operational excellence.

3. Cloud & Infrastructure Engineering

Design, manage, and optimize infrastructure on Microsoft Azure (Mandatory).
Manage Azure Virtual Machines, Networking, Storage, IAM, Azure Monitor, and cloud-native services.
Experience with AWS services such as EC2, S3, RDS, IAM, VPC, and CloudWatch is preferred.
Exposure to Google Cloud Platform (GCP) is a plus.
Administer and optimize Kubernetes clusters and containerized workloads.
Manage Helm deployments and Kubernetes-based applications.
Implement Infrastructure as Code (Terraform preferred).
Support Git-based workflows using GitHub, GitLab, or Azure Repos.

4. Monitoring, Observability & Performance Engineering

Design and implement observability solutions using OpenTelemetry, Prometheus, Grafana, Datadog, CloudWatch, and Azure Monitor.
Build and maintain distributed tracing frameworks using OpenTelemetry.
Establish monitoring based on Golden Signals:
Latency
Traffic
Errors
Saturation
Design symptom-based alerting focused on user impact.
Analyze performance bottlenecks and optimize application and infrastructure performance.
Improve logging, tracing, and monitoring strategies across distributed systems.

5. AI & Cloud-Native Workloads (Good to Have)

Support deployment and operations of Azure AI Services and AI Foundry solutions.
Assist in infrastructure design for RAG (Retrieval-Augmented Generation) workloads.
Ensure scalability, reliability, and observability of AI/ML systems in production.

6. Security & Compliance

Apply cloud security best practices including IAM, network segmentation, and secrets management.
Support vulnerability remediation and security initiatives.
Collaborate with development and security teams to meet compliance requirements.

Required Technical Skills

Core Engineering

Strong scripting/programming skills in Python and Bash.
Good knowledge of Go (Golang) is a plus.
Deep understanding of Linux systems administration.
Strong networking fundamentals (DNS, TCP/IP, Load Balancing, SSL/TLS).
Experience managing highly available production environments.

Cloud & Infrastructure

Microsoft Azure (Mandatory)
Kubernetes and container orchestration
Terraform (Infrastructure as Code)
Helm
GitHub / GitLab / Azure Repos

Monitoring & Observability

OpenTelemetry
Prometheus
Grafana
Datadog
Azure Monitor
CloudWatch
Distributed Tracing
Metrics, Logs, and Observability Best Practices

SRE & Reliability

Incident Response
On-Call Operations
Root Cause Analysis (RCA)
SLI / SLO / SLA Management
Error Budgets
Golden Signals Monitoring
Capacity Planning
Toil Reduction
Reliability Engineering

Preferred Qualifications

Experience with distributed systems architecture.
Exposure to OpenSearch / ELK Stack.
Experience supporting AI/ML workloads in cloud environments.
Familiarity with Azure AI Services and AI Foundry.
Experience building observability frameworks using OpenTelemetry.
Strong understanding of modern SRE methodologies and operational excellence.

What We're Looking For

Ownership mindset with strong accountability.
Calm and effective decision-making during production incidents.
Strong debugging and analytical problem-solving skills.
Deep understanding of SRE principles, observability, and reliability engineering.
Ability to balance incident response with long-term reliability improvements.
Excellent collaboration and communication skills.
Passion for automation, scalability, and continuous improvement.

More Info

Job Type:

Permanent Job

Industry:

Other

Function:

Site Reliability Engineering

Employment Type:

Full time

About Company

acquirex

Job ID: 149337245

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 20-06-2026 05:45:33 PM

Homejobs in PuneSenior Site Reliability Engineer

Similar Jobs