
Search by job, company or skills
We are looking for a Product Manager to own and evolve our Service Assurance
Platform — the system that ensures every service is observable, reliable, and
supported at scale.
This platform sits at the core of our operations, enabling real-time visibility, incident
response, change governance, and customer communication across all services and
regions.
The focus of this role is to improve reliability, reduce operational friction, and
increase customer trust by making the platform consistent, measurable, and
scalable.
What You Will Own
You will own the end-to-end product strategy and evolution of the Service Assurance
platform, including:
Observability & Monitoring
A unified system for logs, metrics, and traces with alerting, retention, and data access
capabilities across all services.Service Health & SLA Reporting
Accurate, real-time visibility into service status, uptime, and SLA performance for every
product and region.
Incident Management & Remediation
Standardized workflows to detect, triage, and resolve incidents, supported by
automation, runbooks, and full auditability.
Change Governance
Consistent processes to test, approve, and roll out changes safely, minimizing risk to
production systems.
Customer Communication & Support
Reliable delivery of alerts, maintenance updates, and incident notifications, along with
structured case management and escalation handling.
Documentation & Knowledge
A centralized, versioned, and searchable documentation system covering all services
and APIs.
Key Responsibilities
• Define and drive the platform roadmap focused on reliability and operational
excellence
• Establish standard workflows and practices across:
o Observability
o Incident managemento Change management
• Improve key operational outcomes:
o Faster detection and resolution of incidents
o Higher service availability and SLA compliance
o Better customer communication and transparency
• Define and track platform-level metrics (e.g., latency of detection, resolution
times, uptime)
• Ensure the platform scales consistently across services and regions
• Partner closely with engineering, SRE, and support teams to drive adoption and
execution
• Identify gaps and eliminate fragmentation across tools and processes
What We're Looking For
Required
• 6–10+ years of product management or equivalent experience
• Experience working on reliability, observability, or operational platforms
• Strong understanding of:
o Distributed systems
o Monitoring and alerting
o Incident response workflows
• Experience in at least two of:
o Observability platforms (logs, metrics, tracing)
o Incident or change management systems
o Customer support or case management platforms
• Ability to work closely with engineering and SRE teams
• Strong systems thinking and operational mindset
Preferred
• Experience at a cloud provider or large-scale SaaS platform
• Familiarity with SRE practices (SLOs, SLIs, error budgets)
• Exposure to observability tools (Prometheus, Grafana, OpenTelemetry, ELK)
• Experience with ITSM platforms (e.g., ServiceNow, Jira Service Management)
• Technical background (engineering or equivalent)What Success Looks Like
Job ID: 145692815