Infrastructure Associate Advisor

evernorth health services

Hyderabad, India

7-9 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

Infrastructure Engineering Associate Advisor

Position Overvie

wThe Pharmacy Benefit Services+ Technology organization is seeking a Site Reliability Engineer (SRE) – Automation, Self‑Healing & AI/AIOps to join our team. This Band 4 Contributor role is a senior, hands‑on position responsible for driving enterprise reliability outcomes, reducing operational toil, and enabling scalable SRE adoption across both legacy platforms and modern cloud‑native systems

.In this role, you will lead the design and implementation of intelligent, automated, and AI‑assisted reliability solutions that ensure systems are resilient, observable, self‑healing, and continuously improving. You will operate at the intersection of software engineering, operations, automation, and AI, influencing how teams design, deploy, and operate production systems

.A core focus of this role is building automation‑first and agentic SRE capabilities, including

:Self‑healing workflows that automatically detect, diagnose, and remediate failure
sAI‑driven operational intelligence (AIOps) for anomaly detection, alert correlation, incident triage, and guided remediatio
nStandardized SRE enablement platforms (SLO automation, reliability scorecards, FMEA workflows) that can be adopted at scale with minimal frictio

n
You will collaborate closely with application teams, platform engineering, DevOps, infrastructure, QE, and IT leadership to embed reliability into the SDLC and runtime operation

s.Your contributions will directly suppor

t:Improved system availability and resilience through proactive reliability engineering and automati
onReduced incidents and faster MTTR via self‑healing and AI‑assisted operatio
nsHigher developer productivity by eliminating manual operational to
ilFaster, safer releases by integrating SRE controls into CI/CD pipelin
esMeasurable reliability improvements, including reductions in MTTD/MTTR, decreased incident frequency, improved SLO compliance, and healthier error‑budget consumption through automation and AI‑enabled operatio

ns
Section 3: Responsibilit

iesClearly outline the primary duties and tasks associated with the role. Use action verbs (i.e., lead, drive, analyze, assess, research, etc.) to convey expectatio

ns.
Responsibilit

ies Core SRE & Reliability Enginee

ringDefine, implement, and operationalize SRE best practices including SLIs, SLOs, error budgets, reliability reviews, and operational readiness standards across multiple teams and platfo
rms.Act as a senior reliability engineer for mission‑critical systems, influencing architectural decisions to improve availability, scalability, and fault tolera
nce.Lead blameless incident response, root‑cause analysis, and post‑incident reviews, ensuring systemic fixes and automation are prioriti

zed.Self‑Healing Automa

tionDesign and implement self‑healing systems t
hat:Automatically detect failures using telemetry and sig
nalsDiagnose probable root causes using rules and AI/ML mo
delsExecute automated remediation actions (restart, scale, reroute, rollback, configuration correct
ion)Build event‑driven automation workflows integrated with monitoring, CI/CD, and infrastructure platforms to reduce human intervent
ion.Develop and maintain automated runbooks and remediation pipelines that evolve based on historical incidents and outco
mes.Extend self‑healing automation to change and release workflows, including automated rollback, safeguarded change execution, and AI‑assisted change‑risk evaluation to reduce change‑related incide
nts.Continuously identify and eliminate operational toil by replacing repetitive manual work with automation, self‑service tooling, and intelligent remediat

ion.AI / Agentic SRE (AI

Ops)Apply AI and ML techniques to improve operational intelligence, includ
ing:Anomaly detection across metrics, logs, and tr
acesAlert deduplication, correlation, and noise reduc
tionIntelligent incident summarization and impact anal
ysisPredictive failure detection and capacity‑risk forecas
tingImplement agentic AI patterns where autonomous or semi‑autonomous age
nts:Continuously monitor system he
althPropose or execute remediation act
ionsLearn from past incidents and operator feed
backEstablish continuous learning feedback loops to tune AI models and agent behavior based on false positives, incident outcomes, and operator rev
iew.Partner with platform and security teams to ensure responsible, secure, and compliant use of AI in production operati

ons.Observability & Telem

etryEstablish observability‑by‑design standards for applications and platforms (metrics, logs, traces, even
ts).Improve signal quality and alerting strategies to focus on user and business impact, not infrastructure no
ise.Build and maintain reliability dashboards and scorecards that provide real‑time and historical insights into service hea

lth.Resilience, Performance & Valida

tionDrive resilience and fault‑tolerance validation using chaos engineering and controlled failure inject
ion.Partner with performance and platform teams to ensure systems meet performance, scalability, and recovery objecti
ves.Promote safe testing practices for legacy‑integrated systems (e.g., service virtualization where direct backend calls pose ri

sk).
CI/CD & Platform Enabl

ementEmbed SRE controls into CI/CD pipelines, inclu
ding:SLO validation
gatesAutomated canary ana
lysisRelease health checks and rollback tri
ggersBuild reusable SRE platforms, templates, and onboarding kits that enable teams to adopt reliability practices with minimal manual ef
fort.Mentor engineers and act as a technical leader for SRE adoption across the organiza
tion.Section 4: Qualifica
tionsSpecify the skills, experience, and education required for the role. Differentiate between the must-haves and nice-to-ha
ves.Required skills: List the specific skills required for the job, including technical, leadership skills, and any industry-specific sk
ills.Required Experience: Clearly state any mandatory requirements, such as formal education, certifications, licenses, or specific years of experi
ence.Desired Experience: List any nice-to-have experience, including industry experience, exposure to specific technologies, certifications,

etc.
Qualific

ations
Required

Skills:Site Reliability Engineering: Deep hands‑on experience with SLOs, error budgets, incident management, and production oper
ations.Automation & Software Engineering: Strong development skills in Python, Go, Java, or similar, with the ability to build production‑grade automation and se
rvices.Self‑Healing Systems: Proven experience designing and implementing automated remediation and closed‑loop recovery wor
kflows.AI / AIOps: Experience applying AI/ML to operations, such as anomaly detection, alert correlation, predictive analysis, or intelligent remed
iation.Observability: Expertise with platforms such as Dynatrace, Prometheus, Grafana, Splunk, AppDynamics, or equi
valent.Cloud & Distributed Systems: Strong understanding of AWS / Azure / GCP, microservices, and Kubernetes / Ope
nShift.CI/CD & DevOps: Experience integrating reliability checks and automation into delivery pip
elines.Infrastructure as Code: Terraform, CloudFormation, or s
imilar.Legacy + Modern Engineering: Ability to support and modernize reliability practices across monoliths, batch jobs, messaging, and mainframe‑integrated s
ystems.Leadership & Influence: Ability to lead through influence, mentor others, and drive adoption across multiple

teams.
Required Experience & Ed

ucation:Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent expe
rience).7+ years of experience in SRE, DevOps, platform engineering, or production software engineerin
g roles.Demonstrated success delivering enterprise‑scale automation, self‑healing, and reliability impro

vements.
Desired Exp

erience: Experience building or contributing to enterprise SRE enablement platforms (SLO automation, reliability scorecards, FMEA wo
rkflows).Hands‑on experience with chaos engineering and resilience testing in production‑like envi
ronments.Familiarity with ServiceNow / CMDB / service modeling to support operational readiness and dependency vi
sibility.Experience applying Generative AI for operational use cases such as runbook generation, incident summarization, and knowledge r
etrieval.Demonstrated delivery of quantifiable reliability improvements (e.g., MTTR reduction, incident volume reduction, improved SLO ad
herence).Experience mentoring engineers and shaping an automation‑first, reliability‑driven

culture.
These two sections will be standardized in the JD template and made not

editable.
Location & Ho

urs of WorkFull-time position, working 40 hours per week. Expected overlap with US hours as
appropriatePrimarily based in the Innovation Hub in Hyderabad, India in a hybrid working model (3 days WFO and

2 days WAH)

More Info

Job Type:

Permanent Job

Industry:

Other

Function:

Information Technology

Employment Type:

Full time

About Company

evernorth health servicesJob Source: www.linkedin.com

Job ID: 148441771

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 28-05-2026 05:20:16 PM

Homejobs in Hyderabad / Secunderabad, TelanganaInfrastructure Associate Advisor

Similar Jobs

Infrastructure Associate Advisor - HIH - Evernorth

evernorth health services

7-9 yrs

Hyderabad, India

Skills:

CI CD DevOps, Infrastructure as Code, Leadership Influence, AI AIOps, Automation Software Engineering, Site Reliability Engineering, Cloud Distributed Systems, Observability

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile

Real-time notifications

Discover new jobs, get recruiter notifications, track applications & more with the foundit App.

Scan to download foundit App