About Business Unit:
At the core of all that Epsilon does is a team that sets the foundation of our IT infrastructure. The team drives innovation and efficiency through pioneering technology across Epsilon's platforms and business verticals. From being the first point of contact for infrastructure needs to final deployment, the team provides end-to-end solutions for our client-facing platforms. ETS supports all aspects of revenue-generating platforms for Epsilon and sets the architectural direction for our enterprise deployments. By adopting the newest technologies, such as Cloud, Automation, and Artificial Intelligence, the team is at the front of redefining our digital business and capturing new opportunities.
Why we are looking for you:
- You need to oversee operational activities, proactively identify and implement service improvements within the Operations Centre.
- You should have good knowledge in Kubernetes, with hands-on experience using platforms, such as Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service (EKS)
- You understand are have experience setting up of enterprise observability, alerting, and manage incident command at scale.
- You should have a deep understanding of Linux, networking protocols (TCP/IP, DNS), and distributed systems.
- Strong skills in monitoring tools (e.g., OpsRamp, ThousandEyes, Grafana, Prometheus) and establishing observability practices.
- Cross-functional abilities to collaborate skills to align with product and engineering partners.
- You should define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- You should act as an escalation point for critical production outages.
What you will enjoy in this role:
- Leading a critical Operations Command Center supporting enterprise-scale infrastructure and services
- Develop internal processes and procedures to enhance operational effectiveness based on industry best practices, creating and maintaining an operational centre of excellence.
- Working closely with global clients and internal partners in a high-visibility role
- Proactive in identifying service improvements and operational efficiencies
- Driving continuous service improvement and operational excellence initiatives
- Operating in a fast-paced, dynamic environment that values accountability and leadership
Click here to view how Epsilon transforms marketing with 1 View, 1 Vision and 1 Voice.
Responsibilities
- Lead, mentor, and upskill a high‑performing DevOps / SRE‑oriented operations team, fostering strong engineering ownership
- Partner with engineering teams to improve CI/CD reliability, release safety, and change automation.
- Own the observability strategy across metrics, logs, and traces to provide real‑time and predictive insights into system health.
- Ensure effective use and optimization of monitoring, alerting, and log analytics platforms (e.g., ELK, Splunk, Zabbix, OpsRamp).
- Continuously tune alerts to minimize noise and improve signal quality, enabling faster detection and resolution.
- Ensure consistent observability of platforms and services to maintain optimal client product uptime.
- Ensure incident, problem, and change processes are lightweight, automated, and outcome‑focused, aligned with ITIL where appropriate.
- Drive service continuity, resilience testing, and disaster recovery readiness.
- Maintain consistent connect with peer teams for smooth operational efficiencies.
- Participate in RCA of issues and address any monitoring/process gaps noticed during any major incident.
- Enables Operations teams through coaching and hands-on guidance to build capabilities and achieve strategic goals.
- Ensure all incidents and requests are processed using documented processes and adhere to SLA/OLA obtainment
- Maintain consistent ticket quality through regular quality assessments.
- Present management level reporting through weekly and monthly summary of incidents/tickets/projects and any challenges observed.
- Evaluates production change proposals and ensures smooth, well governed implementation through deliberate risk‑ aware‑ decisions.
- Ensures service continuity plans are compatible with operational delivery centre operations and that plans are tested regularly
- Review and analyse all Operations management toolsets for best of bread environmental event management.
- Client coordination for preparing Operational run books
- Maintains operational logs and journals on all events, reported issues, warnings, alerts, and alarms, recording and classifying all messages.
- Maintains operational documentation, processes, management, and diagnostic tools, ensuring that services are maintained at the agreed levels.
- Ensure that maintenance tasks are completed as per procedural documentation for scheduled BAU tasks and client-specific infrastructure.
Qualifications
- Bachelor's degree in engineering, Computer Science, IT, or equivalent field
- 12 - 15 years of related experience
- Strong verbal/written communication
- Command Centre/NOC/SOC experience
- Familiarity with application lifecycle and IT Service Management concepts.