Senior Site Reliability Engineer (SRE) / DevOps Engineer

umanist na

Pune, India

10-12 Years

Save

Posted 20 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Additional Important Note For Applicants

Currently, only immediate joiners (who have already completed their notice period) or candidates serving a notice period of up to 30 days will be considered for this opportunity.
Candidates with longer notice periods may not be considered at this stage due to urgent project requirements.

Important Note for Applicants

Kindly read the job description carefully before applying. Please apply only if your experience, technical skills, and notice period align with the mandatory requirements mentioned above. Profiles that do not meet the core criteria may face rejection during the screening process, which can lead to unnecessary time and effort from both sides. We appreciate your understanding and cooperation.

Job Title: Senior Site Reliability Engineer (SRE) / DevOps EngineerExperience:

10+ Years

Location:

Pune (Viman Nagar) – In Office

Shift Timings

3:00 PM – 12:00 AM

On-Call Requirement

24/7 Production Support Rotation

Job Overview

We are looking for an experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer to ensure the reliability, scalability, security, and performance of large-scale production systems across cloud environments.

The ideal candidate should have strong expertise in DevOps automation, infrastructure engineering, incident management, Kubernetes, cloud platforms, observability, and production reliability practices. This role requires balancing hands-on operational ownership with long-term engineering improvements to strengthen system resilience and reduce operational overhead.

Key ResponsibilitiesIncident Response & Production Support

Participate in 24/7 on-call rotation for production systems
Diagnose, troubleshoot, and resolve high-severity production incidents
Lead Root Cause Analysis (RCA) and post-mortem reviews
Implement corrective and preventive measures to avoid recurring issues
Maintain and improve SLAs, SLOs, and reduce MTTR

Reliability Engineering & System Hardening

Design and implement reliability improvements to enhance availability and scalability
Automate repetitive operational tasks and reduce infrastructure toil
Improve redundancy, failover mechanisms, and disaster recovery processes
Monitor and optimize key SRE metrics including latency, error rates, and system capacity

Cloud Infrastructure & Platform Engineering

Manage And Optimize Infrastructure Across

AWS (EC2, S3, RDS, IAM, VPC, CloudWatch)
Google Cloud Platform (Compute Engine, Cloud Storage, Cloud SQL, IAM, VPC)
Microsoft Azure (Virtual Machines, Networking, Storage, Azure Monitor)

Additional Responsibilities

Administer and optimize Kubernetes clusters
Manage Helm deployments and containerized workloads
Implement Infrastructure as Code using Terraform

Monitoring, Observability & Performance Optimization

Design and implement observability and monitoring systems
Build user-impact-driven alerting mechanisms
Work with tools such as:

Prometheus
Grafana
Datadog
AWS CloudWatch
Azure Monitor

Improve logging, tracing, and monitoring practices
Analyze bottlenecks and optimize system performance

AI & Cloud-Native Infrastructure (Good to Have)

Support deployment of AI services on cloud platforms
Assist with infrastructure for RAG (Retrieval-Augmented Generation) workloads
Ensure scalability and reliability of AI/ML systems in production environments

Security & Compliance

Apply cloud security best practices including IAM, secrets management, and network segmentation
Collaborate on vulnerability remediation initiatives
Support infrastructure compliance and security requirements

Required Skills & ExperienceCore Engineering Skills

Strong scripting/programming skills in:

Python
Bash
Go (Good to Have)

Strong Linux administration and networking fundamentals
Experience managing high-availability production systems

Cloud & Infrastructure

Hands-on experience with at least one major cloud platform:

AWS
Azure
GCP

Strong Kubernetes and container orchestration experience
Infrastructure as Code using Terraform
Git-based workflows (GitHub, GitLab, Azure Repos)

Monitoring & Observability

Experience With

Prometheus
Grafana
Datadog
Similar observability platforms

Strong Understanding Of

SLIs
SLOs
SLAs

Preferred Qualifications

Experience supporting AI/ML workloads in cloud environments
Familiarity with distributed systems architecture
Exposure to OpenSearch / ELK Stack
Experience reducing operational toil through automation
Basic knowledge of C# / .NET environments

What We're Looking For

Strong ownership mindset and accountability
Calm and structured approach during production incidents
Excellent debugging and analytical problem-solving skills
Ability to balance operational support with long-term engineering improvements
Strong collaboration skills with development and engineering teams

Skills: terraform,aws,azure,infrastructure as code,kubernetes,gcp,grafana,prometheus,python,go,linux administration,monitoring & observability,bash,datadog