Site Reliability Engineer

IBM

Bengaluru, India

2-5 Years

Save

Posted 5 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Introduction

IBM is seeking an experienced Site Reliability Engineer (SRE) to play a critical role in ensuring the reliability, availability, and performance of the IBM Quantum platform. In this role, you will collaborate closely with our quantum services development teams to design, build, monitor, and scale systems that power one of the world's most advanced quantum computing platforms.

As an SRE within IBM Quantum, you are part of the frontline ensuring seamless operations, rapid recovery, and user trust through operational excellence. Every day brings new engineering challengesranging from incident response to building automation, enhancing observability, and driving systemwide reliability improvements. You will support researchers, developers, and enterprise users exploring the future of computing while applying modern SRE principles to a deeptech environment.

This role is ideal for true SRE practitionersengineers who have done SRE as their primary job.

Your Role And Responsibilities

Ensure high availability, resilience, and scalability of IBM Quantum platforms and services.
Lead incident response, participate in war room activities, and drive postincident reviews and corrective actions.
Collaborate with development teams to debug, deploy, and maintain quantum workloads and backend services.
Establish, refine, and maintain observability across logs, metrics, traces, and alerting systems.
Design and build internal tools, automations, and operational workflows to improve efficiency and reduce toil.
Champion operational ownership, ensuring every quantum job runs reliably with full traceability.
Drive platformwide improvements using operational insights, incident learnings, and reliability patterns.

Preferred Education

Master's Degree

Required Technical And Professional Expertise

25 years of proven professional experience specifically as a Site Reliability Engineer.
Strong systemsthinking ability to correlate logs, traces, metrics, and code across distributed workloads.
Handson experience with incident management, production operations, and oncall responsibilities.
Experience with modern observability tools (Grafana, Sysdig, Jaeger, etc.).
Familiarity with Kubernetes, Linux internals, and programming in Python or Go.
Ability to work across development, infrastructure, and platform teams.
Ability to transform incident learnings into automation, fixes, or architectural improvements.
Understanding of SLI/SLO/SLA frameworks and reliability metrics.

Preferred Technical And Professional Experience