Search by job, company or skills

apto solutions - executive search & consultants

Site Reliability Engineering Manager

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 2 months ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Title: SRE Manager

Location: Hyderabad & Ahmedabad

Experience Required:

10+ years total experience, with 3+ years in a leadership role in SRE or Cloud Operations.

Technical Knowledge and Skills:

Mandatory:

• Deep understanding of Kubernetes, GKE, Prometheus, Terraform

• Cloud: Advanced GCP administration

• CI/CD: Jenkins, Argo CD, GitHub Actions

• Incident Management: Full lifecycle, tools like OpsGenie

Nice to Have:

• Knowledge of service mesh and observability stacks

• Strong scripting skills (Python, Bash)

• BigQuery/Dataflow exposure for telemetry

Scope:

• Build and lead a team of SREs

• Standardize practices for reliability, alerting, and response

• Engage with Engineering and Product leaders

Roles and Responsibilities:

  • Establish and lead the implementation of organizational reliability strategies, aligning SLAs, SLOs, and Error Budgets with business goals and customer expectations.
  • Develop and institutionalize incident response frameworks, including escalation policies, on-call scheduling, service ownership mapping, and RCA process governance.
  • Lead technical reviews for infrastructure reliability design, high-availability architectures, and resiliency patterns across distributed cloud services. Champion observability and monitoring culture by standardizing tooling, alert definitions, dashboard templates, and telemetry data schemas across all product teams.
  • Drive continuous improvement through operational maturity assessments, toil elimination initiatives, and SRE OKRs aligned with product objectives. Collaborate with cloud engineering and platform teams to introduce self-healing systems, capacity-aware autoscaling, and latency-optimized service mesh patterns.
  • Act as the principal escalation point for reliability-related concerns and ensure incident retrospectives lead to measurable improvements in uptime and MTTR.
  • Own runbook standardization, capacity planning, failure mode analysis, and production readiness reviews for new feature launches. Mentor and develop a high-performing SRE team, fostering a proactive ownership culture, encouraging cross-functional knowledge sharing, and establishing technical career pathways.
  • Collaborate with leadership, delivery, and customer stakeholders to define reliability goals, track performance, and demonstrate ROI on SRE investments

More Info

Job Type:
Industry:
Employment Type:

Job ID: 113112021

Similar Jobs

Hyderabad, India

Skills:

JavaScalaPrometheusApache SparkKafkaGrafanaGcpCloudwatchPythonKubernetesAWSFlink

Hyderabad, India

Skills:

GitlabAWSLinux ServersAzureGcpTerraformAzure DevOpsCloud Adoption Framework