About Ascendion
Ascendion is a leader in AI-powered software engineering, helping businesses innovate faster, smarter, and with greater impact. We partner with Global 2000 clients across North America, UK, Europe, and APAC to solve complex challenges in data, experience design, software product engineering, and workforce transformation. Powered by expert engineers, thousands of AI agents, and our Engineering to the Power of AI™ (EngineeringAI) method, we deliver measurable outcomes that build trust, unlock value, and accelerate growth. Learn more at ascendion.com
Ascendion | Engineering to elevate life
We have a culture built on opportunity, inclusion, and a spirit of partnership. Come, change the world with us:
- Build the coolest tech for world's leading brands
- Solve complex problems – and learn new skills
- Experience the power of transforming digital engineering for Fortune 500 clients
- Master your craft with leading training programs and hands-on experience
Experience a community of change makers!
Join a culture of high-performing innovators with endless ideas and a passion for tech. Our culture is the fabric of our company, and it is what makes us unique and diverse. The way we share ideas, learning, experiences, successes, and joy allows everyone to be their best at Ascendion.
About The Role
We are seeking an experienced Service Delivery Manager (SDM) with a strong Application Support background and deep exposure to Site Reliability Engineering (SRE) principles. The role is responsible for end-to-end service delivery of large-scale, mission-critical applications, ensuring high availability, reliability, and performance through proactive monitoring, observability, incident management, and continuous improvement.
The ideal candidate brings hands-on understanding of production environments, leads 24/7 support operations, drives SLO/SLA adherence, and applies SRE best practices to reduce MTTR and MTTD, improve system resilience, and enhance customer experience.
Key Responsibilities
Service Delivery & Operations
- Own end-to-end service delivery for multiple critical applications, ensuring high availability, stability, and performance across production environments.
- Lead 24/7 application support operations (L2/L3), including on-call rotations, incident bridges, escalations, and stakeholder communications.
- Act as the primary escalation point for major incidents, driving resolution until closure and RCA sign-off.
- Ensure consistent adherence to SLAs, OLAs, and KPIs, with continuous tracking and reporting.
SRE & Reliability Engineering
- Apply SRE principles to application support, including:
- Definition and tracking of SLIs, SLOs, and Error Budgets
- Improving reliability, scalability, and fault tolerance
- Drive initiatives to reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through automation, improved alerting, and runbooks.
- Partner with engineering teams to eliminate toil and improve operational efficiency.
Monitoring, Observability & AIOps
- Lead the design and adoption of monitoring and observability platforms using tools such as:
- Prometheus, Grafana, ELK, Dynatrace, OpenTelemetry (OTel)
- Ensure end-to-end visibility across infrastructure, applications, and business transactions.
- Implement proactive monitoring, intelligent alerting, and anomaly detection to prevent incidents before business impact.
- Drive adoption of AIOps and automated incident management (auto-remediation, automated runbooks where applicable).
Incident, Problem & Change Management
- Lead Major Incident Management (MIM) including war rooms, stakeholder updates, and executive reporting.
- Ensure timely and high-quality Root Cause Analysis (RCA) with preventive action plans.
- Govern problem management to identify recurring issues and drive long-term fixes.
- Oversee change and release management, ensuring minimal risk to production systems.
Stakeholder & Vendor Management
- Act as a trusted partner for business stakeholders, engineering teams, vendors, and clients.
- Communicate service health, risks, and improvement plans to senior leadership.
- Manage third-party vendors and support partners, ensuring contract compliance and service quality.
Continuous Improvement & Transformation
- Drive operational excellence initiatives to improve uptime, performance, and customer satisfaction.
- Lead transformation programs involving cloud migration, DevOps, and SRE adoption.
- Identify automation opportunities to reduce manual effort and operational cost.
- Support DR planning, testing, and compliance with RTO/RPO requirements.
Experience
Required Experience & Skills
- 15+ years of experience in Application Support / Production Operations, with 5+ years in a Service Delivery Manager / SRE / Operations Leadership role.
- Proven experience managing large application portfolios (50+ applications) in enterprise environments.
- Strong background in banking, financial services, or large regulated enterprises preferred.
Technical & SRE Skills
- Deep understanding of SRE concepts: SLIs, SLOs, SLAs, error budgets, toil reduction.
- Hands-on experience with monitoring and observability tools (Grafana, Prometheus, ELK, Dynatrace, OTel).
- Solid understanding of cloud platforms (AWS/Azure/GCP), Kubernetes, and hybrid environments.
- Strong knowledge of incident management, RCA, and ITIL/ITSM processes.
- Experience working with CI/CD pipelines and DevOps-driven support models.
Leadership & Soft Skills
- Strong people management experience leading large, distributed support/SRE teams.
- Excellent stakeholder management and executive communication skills.
- Ability to lead under pressure during major incidents.
- Data-driven mindset with focus on measurable reliability and performance outcomes.
- Change agent mindset with the ability to drive cultural adoption of SRE practices.
Success Metrics
- Reduction in MTTR and MTTD
- Improved SLA/SLO compliance
- Increased system availability and reliability (99.9%+)
- Reduced incident recurrence and operational toil
- Improved customer and stakeholder satisfaction