Search by job, company or skills

spot your leaders & consulting

Site Reliability Engineer

9-15 Years
Save
new job description bg glownew job description bg glow
  • Posted 3 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description: Site Reliability Engineer (SRE)

(Notice period - Immediate or maximum 30 days)

  • Total years of experience 9- 15 yrs
  • Need to have experience or exposure in Chaos Engineering or Resilience Testing.
  • Hands on experience in Python/ Bash
  • Hands on experience in Ansible, Gitlab, CI/CD, Gitlab Pages, Jenkins, Terraform
  • Hands on experience in Azure

Required Skills & Experience

  • Strong experience in Core SRE practices, including reliability engineering, incident management, and automation.
  • Proven hands-on experience in Performance Engineering / Performance Testing for large-scale distributed systems.
  • Deep understanding and implementation experience with SLI / SLO / Error Budget frameworks.
  • Hands-on experience with containerization and orchestration (Docker, Kubernetes).
  • Strong background in monitoring, observability, and logging
  • Tools such as Prometheus, Grafana, Datadog, Splunk, ELK Stack.
  • Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
  • Proficiency in scripting and automation using Python, Bash, Terraform, Ansible.
  • Strong troubleshooting skills across application, infrastructure, and network layers.
  • Experience designing and running incident response and post-mortem reviews.
  • Ownership mindset with accountability for service reliability and customer outcomes.
  • Excellent communication, collaboration, and stakeholder management skills.

Nice to Have (SRE+ Skills)

  • Experience with Keptn or similar tools for automated SLO-based quality gates and continuous delivery.
  • Programming experience in Java, especially for debugging, performance profiling, or building automation tools.
  • Familiarity with chaos engineering practices and tools.
  • Experience working in banking, payments, or capital markets domains.
  • Knowledge of security best practices and regulatory compliance in enterprise environment

Responsibilities

What You Will Be Doing

Core SRE & Reliability Engineering

  • Design, implement, and operate highly available, resilient, and scalable systems aligned with SRE best practices.
  • Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance reliability and delivery velocity.
  • Build and maintain service health dashboards to provide real-time visibility into platform stability and customer experience.
  • Reduce toil through extensive automation of operational workflows, alerts, and remediation activities.

Monitoring, Observability & Service Health

  • Design and maintain end-to-end monitoring and observability solutions covering infrastructure, applications, APIs, and user journeys.
  • Implement advanced alerting strategies to reduce noise and improve mean time to detect (MTTD) and mean time to resolution (MTTR).
  • Leverage metrics, logs, and traces to drive root cause analysis and proactive incident prevention.
  • Enable reliability reporting for stakeholders using SLO compliance and service health metrics.

Performance Engineering & Testing

  • Lead performance engineering initiatives, including load testing, stress testing, endurance testing, and capacity validation.
  • Identify performance bottlenecks across application, middleware, database, and infrastructure layers.
  • Conduct capacity planning and performance tuning to support business growth and peak traffic scenarios.
  • Partner with development and QA teams to embed performance testing into CI/CD pipelines.

Incident Management & Operations

  • Lead and participate in incident response activities, including triage, mitigation, recovery, and post-incident reviews.
  • Drive blameless post-mortems and ensure corrective actions are tracked to completion.
  • Participate in on-call rotations, providing 24x7 support for critical production systems.
  • Continuously improve operational readiness and resilience.

Automation, CI/CD & Cloud Operations

  • Design and manage deployment pipelines, configuration management, and environment consistency across lower and production environments.
  • Implement Infrastructure as Code (IaC) practices for repeatable and secure cloud provisioning.
  • Collaborate with DevOps teams to improve deployment reliability, rollback mechanisms, and release safety.
  • Develop and test disaster recovery plans, backup strategies, and failover mechanisms.

Collaboration & Governance

  • Work closely with Development, QA, DevOps, Security, and Product teams to align on reliability and performance goals.
  • Ensure platforms meet security, compliance, and regulatory requirements common in financial services.
  • Act as a reliability and performance advocate throughout the SDLC.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 148223233

Similar Jobs

Pune, India

Skills:

VMwarePowerShellPrometheusGrafanaOffice 365DatadogJenkinsGitAzure AdTerraformAnsibleSplunkKubernetesPythonAWSZertoMicrosoft Entra IDNSX-TActive DirectoryGitLab CIociRubrik

Pune, India

Skills:

GitAppdynamicsDockerLinuxAnsiblePrometheusSplunkGrafanaKubernetesAWS

Pune, India

Skills:

RhelLoad BalancingUNIXAnsibleLinux ServerWindows ServerKubernetesPythonSolarisApplication MonitoringTerraformAixPacket Level AnalysisF5 Administration

Pune, India

Skills:

TerraformNode.jsDistributed SystemsPythonAWSGoInfrastructure-as-CodeObservabilityMonitoring

Pune, India

Skills:

JavaPerformance TuningSqlNosqlFile SystemKubernetesPythonAlertingGoNetwork stackHigh-availabilityDisaster RecoveryJVM conceptsOS servicesMonitoring