Search by job, company or skills

A

Senior Site Reliability Engineer

10-12 Years
Save
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role: Senior Site Reliability Engineer (SRE) / DevOps Engineer

Location: Viman Nagar, Pune (Work From Office)

Experience: 10+ Years

Timings: 3:00 PM – 12:00 AM (Monday - Friday)

About The Role

On-Call Rotation Required (24/7 Production Support)

We are seeking a highly experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer responsible for ensuring the reliability, scalability, security, observability, and performance of mission-critical production systems. This role combines strong DevOps automation expertise with true SRE ownership, including incident management, on-call participation, reliability engineering, observability, root cause analysis, and proactive system improvements.

The ideal candidate will have deep expertise in Microsoft Azure, Kubernetes, OpenTelemetry, Golden Signals monitoring, and modern SRE practices. They should be capable of balancing incident response with long-term engineering initiatives that improve reliability, reduce operational toil, and strengthen system resilience.

Key Responsibilities

1. Incident Response & On-Call Ownership

  • Participate in 24/7 on-call rotation for production systems.
  • Rapidly diagnose, mitigate, and resolve high-severity production incidents.
  • Lead Root Cause Analysis (RCA) and post-mortem documentation.
  • Implement corrective and preventive actions to avoid recurrence.
  • Drive improvements in Mean Time to Recovery (MTTR).
  • Maintain and improve SLAs, SLOs, and reliability objectives.

2. Reliability Engineering & SRE Practices


  • Implement and champion Site Reliability Engineering (SRE) best practices.
  • Define, measure, and improve SLIs, SLOs, SLAs, and Error Budgets.
  • Engineer solutions to eliminate repetitive operational work (toil reduction).
  • Conduct reliability reviews and capacity planning exercises.
  • Improve system redundancy, failover strategies, and disaster recovery readiness.
  • Continuously improve service availability, latency, and operational excellence.

3. Cloud & Infrastructure Engineering


  • Design, manage, and optimize infrastructure on Microsoft Azure (Mandatory).
  • Manage Azure Virtual Machines, Networking, Storage, IAM, Azure Monitor, and cloud-native services.
  • Experience with AWS services such as EC2, S3, RDS, IAM, VPC, and CloudWatch is preferred.
  • Exposure to Google Cloud Platform (GCP) is a plus.
  • Administer and optimize Kubernetes clusters and containerized workloads.
  • Manage Helm deployments and Kubernetes-based applications.
  • Implement Infrastructure as Code (Terraform preferred).
  • Support Git-based workflows using GitHub, GitLab, or Azure Repos.

4. Monitoring, Observability & Performance Engineering


  • Design and implement observability solutions using OpenTelemetry, Prometheus, Grafana, Datadog, CloudWatch, and Azure Monitor.
  • Build and maintain distributed tracing frameworks using OpenTelemetry.
  • Establish monitoring based on Golden Signals:
  • Latency
  • Traffic
  • Errors
  • Saturation
  • Design symptom-based alerting focused on user impact.
  • Analyze performance bottlenecks and optimize application and infrastructure performance.
  • Improve logging, tracing, and monitoring strategies across distributed systems.

5. AI & Cloud-Native Workloads (Good to Have)


  • Support deployment and operations of Azure AI Services and AI Foundry solutions.
  • Assist in infrastructure design for RAG (Retrieval-Augmented Generation) workloads.
  • Ensure scalability, reliability, and observability of AI/ML systems in production.

6. Security & Compliance


  • Apply cloud security best practices including IAM, network segmentation, and secrets management.
  • Support vulnerability remediation and security initiatives.
  • Collaborate with development and security teams to meet compliance requirements.

Required Technical Skills


Core Engineering

  • Strong scripting/programming skills in Python and Bash.
  • Good knowledge of Go (Golang) is a plus.
  • Deep understanding of Linux systems administration.
  • Strong networking fundamentals (DNS, TCP/IP, Load Balancing, SSL/TLS).
  • Experience managing highly available production environments.

Cloud & Infrastructure


  • Microsoft Azure (Mandatory)
  • Kubernetes and container orchestration
  • Terraform (Infrastructure as Code)
  • Helm
  • GitHub / GitLab / Azure Repos

Monitoring & Observability


  • OpenTelemetry
  • Prometheus
  • Grafana
  • Datadog
  • Azure Monitor
  • CloudWatch
  • Distributed Tracing
  • Metrics, Logs, and Observability Best Practices

SRE & Reliability


  • Incident Response
  • On-Call Operations
  • Root Cause Analysis (RCA)
  • SLI / SLO / SLA Management
  • Error Budgets
  • Golden Signals Monitoring
  • Capacity Planning
  • Toil Reduction
  • Reliability Engineering

Preferred Qualifications


  • Experience with distributed systems architecture.
  • Exposure to OpenSearch / ELK Stack.
  • Experience supporting AI/ML workloads in cloud environments.
  • Familiarity with Azure AI Services and AI Foundry.
  • Experience building observability frameworks using OpenTelemetry.
  • Strong understanding of modern SRE methodologies and operational excellence.

What We're Looking For


  • Ownership mindset with strong accountability.
  • Calm and effective decision-making during production incidents.
  • Strong debugging and analytical problem-solving skills.
  • Deep understanding of SRE principles, observability, and reliability engineering.
  • Ability to balance incident response with long-term reliability improvements.
  • Excellent collaboration and communication skills.
  • Passion for automation, scalability, and continuous improvement.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149337245

Similar Jobs

Pune, India

Skills:

snowflake BigQueryTerraformSparkKafkaKubernetesPythonRedisRayFlink

Pune, India

Skills:

containerization KibanaPuppetDevopsGrafanaTableauKafkaSystem AdministrationMySQLNosqlGcpGitInfrastructure ManagementVirtualizationAnsibleElastic SearchAWSPrometheusPerforceAutomationKubernetesZabbixAzureDockerJenkinsMonitoringFilebeatChef

Hyderabad, Chennai, Pune

Skills:

TerraformSaasKubernetesIncident ResponseAI-powered AutomationObservability