Senior Site Reliability Engineer

Dunnhumby

Gurugram, Gurugram, India

6-8 Years

Save

Posted 13 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

dunnhumby is the global leader in Customer Data Science, partnering with the world's most ambitious retailers and brands to put the customer at the heart of every decision. We combine deep insight, advanced technology, and close collaboration to help our clients grow, innovate, and deliver measurable value for their customers.

dunnhumby employs nearly 2,500 experts in offices throughout Europe, Asia, Africa, and the Americas working for transformative, iconic brands such as Tesco, Coca-Cola, Nestl, Unilever and Metro.

Tesco Media is building a world-class, self-serve B2B advertising platform that enables retailers and brands to plan, activate, and measure omnichannel retail media campaigns

Retail Media is transforming how advertisers connect with consumers through personalized and targeted campaigns across retailers digital and physical touchpoints. Retail Media Measurement plays a pivotal role in ensuring the effectiveness of these campaigns, driving value for advertisers, retailers, and consumers alike

We are seeking a Senior Site Reliability Engineer to drive observability, monitoring, and reliability across distributed systems, ensuring strong alignment with service health and customer impact.

You will lead incident response, root cause analysis, and post-incident improvements while partnering with teams to meet SLAs and operational goals.

In this role, you will champion automation, Infrastructure as Code, and best practices, while mentoring engineers and enhancing overall platform resilience and performance.

Key Responsibilities

Design and implement monitoring and observability strategies across services and infrastructure.
Develop dashboards, alerts, and metrics to improve system visibility and reliability.
Define alerting standards to ensure alerts represent real service impact.
Lead troubleshooting efforts during complex incidents and support root cause analysis.
Improve monitoring coverage for critical services and platform components.
Partner with service owners to align monitoring with SLAs, KPIs, and operational goals.
Lead post-incident reviews and drive improvements in monitoring and detection.
Automate operational workflows and reliability processes.
Mentor engineers and promote operational excellence within the team.
Collaborate with engineering, operations, and product teams to improve platform reliability.
Maintain infrastructure provisioning practices using Terraform and Infrastructure as Code.

Required Experience

6-8 years of experience in Site Reliability Engineering, platform operations, or infrastructure reliability.
Proven experience leading technical initiatives or guiding reliability practices within engineering teams.
Strong experience managing production environments and large-scale distributed systems.
Experience implementing Infrastructure as Code using Terraform.
Experience working with cloud platforms such as GCP or Azure.
Experience supporting high-availability systems in 24/7 operational environments.
Strong communication and stakeholder management skills.

Preferred Experience

Experience with observability platforms such as Grafana, Prometheus, Splunk, or New Relic.
Experience supporting Media, streaming, or SaaS platforms at scale.
Exposure to advanced monitoring practices such as predictive monitoring or AIOps.

What you can expect from us

We won't just meet your expectations. We'll defy them. So you'll enjoy the comprehensive rewards package you'd expect from a leading technology company. But also, a degree of personal flexibility you might not expect. Plus, thoughtful perks, like flexible working hours and your birthday off.

You'll also benefit from an investment in cutting-edge technology that reflects our global ambition. But with a nimble, small-business feel that gives you the freedom to play, experiment and learn.

And we don't just talk about diversity and inclusion. We live it every day with thriving networks including dh Gender Equality Network, dh Proud, dh Family, dh One, dh Enabled and dh Thrive as the living proof. We want everyone to have the opportunity to shine and perform at your best throughout our recruitment process. Please let us know how we can make this process work best for you.

Our approach to Flexible Working

At dunnhumby, we value and respect difference and are committed to building an inclusive culture by creating an environment where you can balance a successful career with your commitments and interests outside of work.

We believe that you will do your best at work if you have a work / life balance. Some roles lend themselves to flexible options more than others, so if this is important to you please raise this with your recruiter, as we are open to discussing agile working opportunities during the hiring process.