Site Reliability Engineer

LTM

Hyderabad, India

3-5 Years

Save

Posted 14 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Location : Hyderabad-L&T Metro-flr 1-9,11&12

Job Title: Jr. Site Reliability Engineer (SRE) Azure Storage

Role Overview

We are seeking a Site Reliability Engineer (SRE) to support Azure Storage deployments and operations across public, sovereign, and preproduction environments. The role focuses on deployment reliability, incident response, infrastructure health, automation, and datadriven operational insights.

Key Responsibilities

Reliability, Deployments & Operations

Execute Azure Storage (Classic/ XPF / Direct Drive) tenant and infrastructure deployments across public, sovereign, and preproduction environments.
Monitor and maintain server uptime, tenant stability, and overall environment health.
Track and reduce offline capacity and longrunning (longtail) deployments to improve deployment completion times.
Manage endtoend release tracking for storage components and ensure deployment compliance.

Incident Management & Troubleshooting

Acknowledge, triage, and resolve deploymentrelated incidents and operational alerts.
Apply technical mitigations (including node recovery) to unblock critical deployments.
Lead Severity2 bridge calls, coordinating with engineering, partner, and vendor teams through resolution.
Manually create and manage Incident Communication Management (ICM) records when required.

Root Cause Analysis & Stability Improvements

Perform root cause analysis (RCA) for hardware, infrastructure, and releaserelated failures.
Analyze recurring deployment faults and failure trends; file defects with actionable remediation details.
Investigate and correct incorrect faultbucket assignments to improve diagnostic accuracy.
Collect and analyze hardware logs; deliver structured reports to engineering and vendor teams.

Process, Automation & Documentation

Identify and drive automation opportunities for repetitive or highrisk operational tasks.
Develop, maintain, and publish SOPs, TSGs, troubleshooting playbooks, and KB articles.
Improve workflows through automation, procedural updates, and process optimizations.

Reporting & Stakeholder Communication

Publish daily operational status reports and defect summaries.
Deliver weekly dashboards and quality reports covering deployment health, reliability metrics, and SLO adherence.
Provide regular status updates to stakeholders and participate in daily syncs with oncall teams.

Required Skills & Experience

Strong experience in Azure cloud operations, SRE, or largescale infrastructure support.
Handson experience with incident triage, RCA, and production support.
Solid understanding of storage systems, hardware failures, and deployment pipelines.
Experience working in 24x7 oncall / shiftbased operational environments.

GoodtoHave / Preferred Skills

HyperV: Virtualization troubleshooting and hostlevel diagnostics.
Azure DevOps: CI/CD pipelines, release tracking, automation, and operational workflows.
Kusto / Azure Data Explorer (ADX):
Writing KQL queries for operational insights
Building dashboards for deployment health, defects, capacity, and reliability metrics
Experience with automation scripting (PowerShell, Python, or similar).

Work Model

Hybrid: 3 days work from office, 2 days work from home.

Note:

Resource's are multiparked thus allocation will be on FCFS basis only.
CI Blocking is valid for 3 days, if no CI feedback is received within 3 days, resource will be automatically made available for other requirements.
If no response on proposed profiles within 3 days, RR will be marked on hold.