Search by job, company or skills

LTM

Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Location : Hyderabad-L&T Metro-flr 1-9,11&12

Job Title: Jr. Site Reliability Engineer (SRE) Azure Storage

Role Overview

We are seeking a Site Reliability Engineer (SRE) to support Azure Storage deployments and operations across public, sovereign, and preproduction environments. The role focuses on deployment reliability, incident response, infrastructure health, automation, and datadriven operational insights.

Key Responsibilities

Reliability, Deployments & Operations

  • Execute Azure Storage (Classic/ XPF / Direct Drive) tenant and infrastructure deployments across public, sovereign, and preproduction environments.
  • Monitor and maintain server uptime, tenant stability, and overall environment health.
  • Track and reduce offline capacity and longrunning (longtail) deployments to improve deployment completion times.
  • Manage endtoend release tracking for storage components and ensure deployment compliance.

Incident Management & Troubleshooting

  • Acknowledge, triage, and resolve deploymentrelated incidents and operational alerts.
  • Apply technical mitigations (including node recovery) to unblock critical deployments.
  • Lead Severity2 bridge calls, coordinating with engineering, partner, and vendor teams through resolution.
  • Manually create and manage Incident Communication Management (ICM) records when required.

Root Cause Analysis & Stability Improvements

  • Perform root cause analysis (RCA) for hardware, infrastructure, and releaserelated failures.
  • Analyze recurring deployment faults and failure trends; file defects with actionable remediation details.
  • Investigate and correct incorrect faultbucket assignments to improve diagnostic accuracy.
  • Collect and analyze hardware logs; deliver structured reports to engineering and vendor teams.

Process, Automation & Documentation

  • Identify and drive automation opportunities for repetitive or highrisk operational tasks.
  • Develop, maintain, and publish SOPs, TSGs, troubleshooting playbooks, and KB articles.
  • Improve workflows through automation, procedural updates, and process optimizations.

Reporting & Stakeholder Communication

  • Publish daily operational status reports and defect summaries.
  • Deliver weekly dashboards and quality reports covering deployment health, reliability metrics, and SLO adherence.
  • Provide regular status updates to stakeholders and participate in daily syncs with oncall teams.

Required Skills & Experience

  • Strong experience in Azure cloud operations, SRE, or largescale infrastructure support.
  • Handson experience with incident triage, RCA, and production support.
  • Solid understanding of storage systems, hardware failures, and deployment pipelines.
  • Experience working in 24x7 oncall / shiftbased operational environments.

GoodtoHave / Preferred Skills

  • HyperV: Virtualization troubleshooting and hostlevel diagnostics.
  • Azure DevOps: CI/CD pipelines, release tracking, automation, and operational workflows.
  • Kusto / Azure Data Explorer (ADX):
  • Writing KQL queries for operational insights
  • Building dashboards for deployment health, defects, capacity, and reliability metrics
  • Experience with automation scripting (PowerShell, Python, or similar).

Work Model

  • Hybrid: 3 days work from office, 2 days work from home.

Note:

  1. Resource's are multiparked thus allocation will be on FCFS basis only.
  2. CI Blocking is valid for 3 days, if no CI feedback is received within 3 days, resource will be automatically made available for other requirements.
  3. If no response on proposed profiles within 3 days, RR will be marked on hold.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144219221

Similar Jobs