Search by job, company or skills

umanist na

Senior Site Reliability Engineer (SRE) / DevOps Engineer

10-12 Years
Save
new job description bg glownew job description bg glow
  • Posted 20 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Additional Important Note For Applicants

  • Currently, only immediate joiners (who have already completed their notice period) or candidates serving a notice period of up to 30 days will be considered for this opportunity.
  • Candidates with longer notice periods may not be considered at this stage due to urgent project requirements.

Important Note for Applicants

Kindly read the job description carefully before applying. Please apply only if your experience, technical skills, and notice period align with the mandatory requirements mentioned above. Profiles that do not meet the core criteria may face rejection during the screening process, which can lead to unnecessary time and effort from both sides. We appreciate your understanding and cooperation.

Job Title: Senior Site Reliability Engineer (SRE) / DevOps EngineerExperience:

10+ Years

Location:

Pune (Viman Nagar) – In Office

Shift Timings

3:00 PM – 12:00 AM

On-Call Requirement

24/7 Production Support Rotation

Job Overview

We are looking for an experienced Senior Site Reliability Engineer (SRE) / DevOps Engineer to ensure the reliability, scalability, security, and performance of large-scale production systems across cloud environments.

The ideal candidate should have strong expertise in DevOps automation, infrastructure engineering, incident management, Kubernetes, cloud platforms, observability, and production reliability practices. This role requires balancing hands-on operational ownership with long-term engineering improvements to strengthen system resilience and reduce operational overhead.

Key ResponsibilitiesIncident Response & Production Support

  • Participate in 24/7 on-call rotation for production systems
  • Diagnose, troubleshoot, and resolve high-severity production incidents
  • Lead Root Cause Analysis (RCA) and post-mortem reviews
  • Implement corrective and preventive measures to avoid recurring issues
  • Maintain and improve SLAs, SLOs, and reduce MTTR

Reliability Engineering & System Hardening

  • Design and implement reliability improvements to enhance availability and scalability
  • Automate repetitive operational tasks and reduce infrastructure toil
  • Improve redundancy, failover mechanisms, and disaster recovery processes
  • Monitor and optimize key SRE metrics including latency, error rates, and system capacity

Cloud Infrastructure & Platform Engineering

Manage And Optimize Infrastructure Across

  • AWS (EC2, S3, RDS, IAM, VPC, CloudWatch)
  • Google Cloud Platform (Compute Engine, Cloud Storage, Cloud SQL, IAM, VPC)
  • Microsoft Azure (Virtual Machines, Networking, Storage, Azure Monitor)

Additional Responsibilities

  • Administer and optimize Kubernetes clusters
  • Manage Helm deployments and containerized workloads
  • Implement Infrastructure as Code using Terraform

Monitoring, Observability & Performance Optimization

  • Design and implement observability and monitoring systems
  • Build user-impact-driven alerting mechanisms
  • Work with tools such as:
    • Prometheus
    • Grafana
    • Datadog
    • AWS CloudWatch
    • Azure Monitor
  • Improve logging, tracing, and monitoring practices
  • Analyze bottlenecks and optimize system performance
AI & Cloud-Native Infrastructure (Good to Have)

  • Support deployment of AI services on cloud platforms
  • Assist with infrastructure for RAG (Retrieval-Augmented Generation) workloads
  • Ensure scalability and reliability of AI/ML systems in production environments

Security & Compliance

  • Apply cloud security best practices including IAM, secrets management, and network segmentation
  • Collaborate on vulnerability remediation initiatives
  • Support infrastructure compliance and security requirements

Required Skills & ExperienceCore Engineering Skills

  • Strong scripting/programming skills in:
    • Python
    • Bash
    • Go (Good to Have)
  • Strong Linux administration and networking fundamentals
  • Experience managing high-availability production systems
Cloud & Infrastructure

  • Hands-on experience with at least one major cloud platform:
    • AWS
    • Azure
    • GCP
  • Strong Kubernetes and container orchestration experience
  • Infrastructure as Code using Terraform
  • Git-based workflows (GitHub, GitLab, Azure Repos)
Monitoring & Observability

Experience With

  • Prometheus
  • Grafana
  • Datadog
  • Similar observability platforms

Strong Understanding Of

  • SLIs
  • SLOs
  • SLAs

Preferred Qualifications

  • Experience supporting AI/ML workloads in cloud environments
  • Familiarity with distributed systems architecture
  • Exposure to OpenSearch / ELK Stack
  • Experience reducing operational toil through automation
  • Basic knowledge of C# / .NET environments

What We're Looking For

  • Strong ownership mindset and accountability
  • Calm and structured approach during production incidents
  • Excellent debugging and analytical problem-solving skills
  • Ability to balance operational support with long-term engineering improvements
  • Strong collaboration skills with development and engineering teams

Skills: terraform,aws,azure,infrastructure as code,kubernetes,gcp,grafana,prometheus,python,go,linux administration,monitoring & observability,bash,datadog

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 148482687