Infrastructure Engineering Associate Advisor
Position Overvie
wThe Pharmacy Benefit Services+ Technology organization is seeking a Site Reliability Engineer (SRE) – Automation, Self‑Healing & AI/AIOps to join our team. This Band 4 Contributor role is a senior, hands‑on position responsible for driving enterprise reliability outcomes, reducing operational toil, and enabling scalable SRE adoption across both legacy platforms and modern cloud‑native systems
.In this role, you will lead the design and implementation of intelligent, automated, and AI‑assisted reliability solutions that ensure systems are resilient, observable, self‑healing, and continuously improving. You will operate at the intersection of software engineering, operations, automation, and AI, influencing how teams design, deploy, and operate production systems
.A core focus of this role is building automation‑first and agentic SRE capabilities, including
- :Self‑healing workflows that automatically detect, diagnose, and remediate failure
- sAI‑driven operational intelligence (AIOps) for anomaly detection, alert correlation, incident triage, and guided remediatio
- nStandardized SRE enablement platforms (SLO automation, reliability scorecards, FMEA workflows) that can be adopted at scale with minimal frictio
n
You will collaborate closely with application teams, platform engineering, DevOps, infrastructure, QE, and IT leadership to embed reliability into the SDLC and runtime operation
s.Your contributions will directly suppor
- t:Improved system availability and resilience through proactive reliability engineering and automati
- onReduced incidents and faster MTTR via self‑healing and AI‑assisted operatio
- nsHigher developer productivity by eliminating manual operational to
- ilFaster, safer releases by integrating SRE controls into CI/CD pipelin
- esMeasurable reliability improvements, including reductions in MTTD/MTTR, decreased incident frequency, improved SLO compliance, and healthier error‑budget consumption through automation and AI‑enabled operatio
ns
Section 3: Responsibilit
iesClearly outline the primary duties and tasks associated with the role. Use action verbs (i.e., lead, drive, analyze, assess, research, etc.) to convey expectatio
ns.
Responsibilit
ies Core SRE & Reliability Enginee
- ringDefine, implement, and operationalize SRE best practices including SLIs, SLOs, error budgets, reliability reviews, and operational readiness standards across multiple teams and platfo
- rms.Act as a senior reliability engineer for mission‑critical systems, influencing architectural decisions to improve availability, scalability, and fault tolera
- nce.Lead blameless incident response, root‑cause analysis, and post‑incident reviews, ensuring systemic fixes and automation are prioriti
zed.Self‑Healing Automa
- tionDesign and implement self‑healing systems t
- hat:Automatically detect failures using telemetry and sig
- nalsDiagnose probable root causes using rules and AI/ML mo
- delsExecute automated remediation actions (restart, scale, reroute, rollback, configuration correct
- ion)Build event‑driven automation workflows integrated with monitoring, CI/CD, and infrastructure platforms to reduce human intervent
- ion.Develop and maintain automated runbooks and remediation pipelines that evolve based on historical incidents and outco
- mes.Extend self‑healing automation to change and release workflows, including automated rollback, safeguarded change execution, and AI‑assisted change‑risk evaluation to reduce change‑related incide
- nts.Continuously identify and eliminate operational toil by replacing repetitive manual work with automation, self‑service tooling, and intelligent remediat
ion.AI / Agentic SRE (AI
- Ops)Apply AI and ML techniques to improve operational intelligence, includ
- ing:Anomaly detection across metrics, logs, and tr
- acesAlert deduplication, correlation, and noise reduc
- tionIntelligent incident summarization and impact anal
- ysisPredictive failure detection and capacity‑risk forecas
- tingImplement agentic AI patterns where autonomous or semi‑autonomous age
- nts:Continuously monitor system he
- althPropose or execute remediation act
- ionsLearn from past incidents and operator feed
- backEstablish continuous learning feedback loops to tune AI models and agent behavior based on false positives, incident outcomes, and operator rev
- iew.Partner with platform and security teams to ensure responsible, secure, and compliant use of AI in production operati
ons.Observability & Telem
- etryEstablish observability‑by‑design standards for applications and platforms (metrics, logs, traces, even
- ts).Improve signal quality and alerting strategies to focus on user and business impact, not infrastructure no
- ise.Build and maintain reliability dashboards and scorecards that provide real‑time and historical insights into service hea
lth.Resilience, Performance & Valida
- tionDrive resilience and fault‑tolerance validation using chaos engineering and controlled failure inject
- ion.Partner with performance and platform teams to ensure systems meet performance, scalability, and recovery objecti
- ves.Promote safe testing practices for legacy‑integrated systems (e.g., service virtualization where direct backend calls pose ri
sk).
CI/CD & Platform Enabl
- ementEmbed SRE controls into CI/CD pipelines, inclu
- ding:SLO validation
- gatesAutomated canary ana
- lysisRelease health checks and rollback tri
- ggersBuild reusable SRE platforms, templates, and onboarding kits that enable teams to adopt reliability practices with minimal manual ef
- fort.Mentor engineers and act as a technical leader for SRE adoption across the organiza
- tion.Section 4: Qualifica
- tionsSpecify the skills, experience, and education required for the role. Differentiate between the must-haves and nice-to-ha
- ves.Required skills: List the specific skills required for the job, including technical, leadership skills, and any industry-specific sk
- ills.Required Experience: Clearly state any mandatory requirements, such as formal education, certifications, licenses, or specific years of experi
- ence.Desired Experience: List any nice-to-have experience, including industry experience, exposure to specific technologies, certifications,
etc.
Qualific
ations
Required
- Skills:Site Reliability Engineering: Deep hands‑on experience with SLOs, error budgets, incident management, and production oper
- ations.Automation & Software Engineering: Strong development skills in Python, Go, Java, or similar, with the ability to build production‑grade automation and se
- rvices.Self‑Healing Systems: Proven experience designing and implementing automated remediation and closed‑loop recovery wor
- kflows.AI / AIOps: Experience applying AI/ML to operations, such as anomaly detection, alert correlation, predictive analysis, or intelligent remed
- iation.Observability: Expertise with platforms such as Dynatrace, Prometheus, Grafana, Splunk, AppDynamics, or equi
- valent.Cloud & Distributed Systems: Strong understanding of AWS / Azure / GCP, microservices, and Kubernetes / Ope
- nShift.CI/CD & DevOps: Experience integrating reliability checks and automation into delivery pip
- elines.Infrastructure as Code: Terraform, CloudFormation, or s
- imilar.Legacy + Modern Engineering: Ability to support and modernize reliability practices across monoliths, batch jobs, messaging, and mainframe‑integrated s
- ystems.Leadership & Influence: Ability to lead through influence, mentor others, and drive adoption across multiple
teams.
Required Experience & Ed
- ucation:Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent expe
- rience).7+ years of experience in SRE, DevOps, platform engineering, or production software engineerin
- g roles.Demonstrated success delivering enterprise‑scale automation, self‑healing, and reliability impro
vements.
Desired Exp
- erience: Experience building or contributing to enterprise SRE enablement platforms (SLO automation, reliability scorecards, FMEA wo
- rkflows).Hands‑on experience with chaos engineering and resilience testing in production‑like envi
- ronments.Familiarity with ServiceNow / CMDB / service modeling to support operational readiness and dependency vi
- sibility.Experience applying Generative AI for operational use cases such as runbook generation, incident summarization, and knowledge r
- etrieval.Demonstrated delivery of quantifiable reliability improvements (e.g., MTTR reduction, incident volume reduction, improved SLO ad
- herence).Experience mentoring engineers and shaping an automation‑first, reliability‑driven
culture.
These two sections will be standardized in the JD template and made not
editable.
Location & Ho
- urs of WorkFull-time position, working 40 hours per week. Expected overlap with US hours as
- appropriatePrimarily based in the Innovation Hub in Hyderabad, India in a hybrid working model (3 days WFO and
2 days WAH)