Role: Senior Site Reliability Engineer (SRE) / DevOps Engineer
Location: Viman Nagar, Pune (Work From Office)
Experience: 10+ Years
Timings: 3:00 PM – 12:00 AM (Monday - Friday)
About The Role
On-Call Rotation Required (24/7 Production Support)
We are seeking a highly experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer responsible for ensuring the reliability, scalability, security, observability, and performance of mission-critical production systems. This role combines strong DevOps automation expertise with true SRE ownership, including incident management, on-call participation, reliability engineering, observability, root cause analysis, and proactive system improvements.
The ideal candidate will have deep expertise in Microsoft Azure, Kubernetes, OpenTelemetry, Golden Signals monitoring, and modern SRE practices. They should be capable of balancing incident response with long-term engineering initiatives that improve reliability, reduce operational toil, and strengthen system resilience.
Key Responsibilities
1. Incident Response & On-Call Ownership
- Participate in 24/7 on-call rotation for production systems.
- Rapidly diagnose, mitigate, and resolve high-severity production incidents.
- Lead Root Cause Analysis (RCA) and post-mortem documentation.
- Implement corrective and preventive actions to avoid recurrence.
- Drive improvements in Mean Time to Recovery (MTTR).
- Maintain and improve SLAs, SLOs, and reliability objectives.
2. Reliability Engineering & SRE Practices
- Implement and champion Site Reliability Engineering (SRE) best practices.
- Define, measure, and improve SLIs, SLOs, SLAs, and Error Budgets.
- Engineer solutions to eliminate repetitive operational work (toil reduction).
- Conduct reliability reviews and capacity planning exercises.
- Improve system redundancy, failover strategies, and disaster recovery readiness.
- Continuously improve service availability, latency, and operational excellence.
3. Cloud & Infrastructure Engineering
- Design, manage, and optimize infrastructure on Microsoft Azure (Mandatory).
- Manage Azure Virtual Machines, Networking, Storage, IAM, Azure Monitor, and cloud-native services.
- Experience with AWS services such as EC2, S3, RDS, IAM, VPC, and CloudWatch is preferred.
- Exposure to Google Cloud Platform (GCP) is a plus.
- Administer and optimize Kubernetes clusters and containerized workloads.
- Manage Helm deployments and Kubernetes-based applications.
- Implement Infrastructure as Code (Terraform preferred).
- Support Git-based workflows using GitHub, GitLab, or Azure Repos.
4. Monitoring, Observability & Performance Engineering
- Design and implement observability solutions using OpenTelemetry, Prometheus, Grafana, Datadog, CloudWatch, and Azure Monitor.
- Build and maintain distributed tracing frameworks using OpenTelemetry.
- Establish monitoring based on Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
- Design symptom-based alerting focused on user impact.
- Analyze performance bottlenecks and optimize application and infrastructure performance.
- Improve logging, tracing, and monitoring strategies across distributed systems.
5. AI & Cloud-Native Workloads (Good to Have)
- Support deployment and operations of Azure AI Services and AI Foundry solutions.
- Assist in infrastructure design for RAG (Retrieval-Augmented Generation) workloads.
- Ensure scalability, reliability, and observability of AI/ML systems in production.
6. Security & Compliance
- Apply cloud security best practices including IAM, network segmentation, and secrets management.
- Support vulnerability remediation and security initiatives.
- Collaborate with development and security teams to meet compliance requirements.
Required Technical Skills
Core Engineering
- Strong scripting/programming skills in Python and Bash.
- Good knowledge of Go (Golang) is a plus.
- Deep understanding of Linux systems administration.
- Strong networking fundamentals (DNS, TCP/IP, Load Balancing, SSL/TLS).
- Experience managing highly available production environments.
Cloud & Infrastructure
- Microsoft Azure (Mandatory)
- Kubernetes and container orchestration
- Terraform (Infrastructure as Code)
- Helm
- GitHub / GitLab / Azure Repos
Monitoring & Observability
- OpenTelemetry
- Prometheus
- Grafana
- Datadog
- Azure Monitor
- CloudWatch
- Distributed Tracing
- Metrics, Logs, and Observability Best Practices
SRE & Reliability
- Incident Response
- On-Call Operations
- Root Cause Analysis (RCA)
- SLI / SLO / SLA Management
- Error Budgets
- Golden Signals Monitoring
- Capacity Planning
- Toil Reduction
- Reliability Engineering
Preferred Qualifications
- Experience with distributed systems architecture.
- Exposure to OpenSearch / ELK Stack.
- Experience supporting AI/ML workloads in cloud environments.
- Familiarity with Azure AI Services and AI Foundry.
- Experience building observability frameworks using OpenTelemetry.
- Strong understanding of modern SRE methodologies and operational excellence.
What We're Looking For
- Ownership mindset with strong accountability.
- Calm and effective decision-making during production incidents.
- Strong debugging and analytical problem-solving skills.
- Deep understanding of SRE principles, observability, and reliability engineering.
- Ability to balance incident response with long-term reliability improvements.
- Excellent collaboration and communication skills.
- Passion for automation, scalability, and continuous improvement.