Professional Services -Service Practice
Full-time
- Remote (India / US / EU) Experience : 5-8 years
About Lyzr
Lyzr.ai's agentic AI platform powers intelligent, autonomous workflows for enterprise clients. Production Support Engineers are the front line that keeps those workflows healthy — triaging incidents, resolving tickets, digging into logs, and escalating the right issues to the right teams before clients feel the pain.
This role suits someone who thrives in a fast-paced technical environment, takes ownership seriously, and genuinely enjoys the detective work of diagnosing why something broke in production. You will work within a global follow-the-sun support model, reporting to the Production Support Lead.
What You'll Do
Incident command & escalation
- Own the full incident lifecycle — detection, triage, war-room coordination, resolution, and post-mortem — for P1/P2 issues across all production tenants.
- Act as the primary escalation point for Production Support Engineers; make the call on severity reclassification and client communication timing.
- Drive RCA completion within SLA windows and ensure corrective actions are tracked to closure in Jira/Confluence.
- Maintain and continuously improve the P1 runbook library, escalation trees, and on-call rotation schedules.
Team leadership & operations
- Manage and mentor a team of 3–6 Production Support Engineers; run weekly 1:1s, set KPIs, and own the performance review cycle.
- Build and optimise the shift rota for 24x7x365 follow-the-sun coverage across India, EU, and US time zones.
- Define and track operational metrics: MTTR, SLA attainment by priority tier, re-open rate, and backlog aging.
- Partner with Engineering and Platform teams to advocate for supportability improvements, observability tooling, and bug-fix prioritisation.
Client & commercial accountability
- Serve as the named support contact for strategic accounts during critical incidents; provide executive-level written updates under pressure.
- Review monthly SLA performance reports with client stakeholders; identify systemic patterns and propose proactive remediation.
- Contribute to SLA definition in new SOWs, ensuring commitments are operationally deliverable.
- Support the renewal and expansion process by demonstrating support maturity and service quality data.
Process & tooling
- Own the support toolchain: ticketing (Jira Service Management or equivalent), monitoring dashboards, alerting rules, and on-call tooling (PagerDuty / OpsGenie).
- Establish knowledge management practices — internal runbooks, known-error database, and a tiered FAQ — to reduce repeat escalations to Engineering.
- Define and enforce severity classification criteria and ticket hygiene standards across the team.
What You Bring
- Experience : 5–8 years in production/application support; 2+ years in a lead or senior role
- Domain: SaaS / AI / ML platform support; ideally agentic or LLM-based systems
- Incident mgmt.: ITIL Foundation or equivalent; proven P1 incident commander
- Tooling: Jira SM, PagerDuty / OpsGenie, Datadog / Grafana, Confluence
- Leadership: Direct team management experience; mentoring junior engineers
- Communication: Executive-level written updates under high-pressure conditions
Additionally, You Will Have
- Hands-on familiarity with cloud infrastructure (AWS / GCP / Azure) and container environments (Kubernetes, Docker).
- Ability to read logs, traces, and basic Python/SQL to independently diagnose issues before engaging Engineering.
- Bonus: experience supporting multi-tenant SaaS at scale, or prior work with AI/ML pipelines in production.
- Bonus: familiarity with enterprise client SLA frameworks — P1/P2/P3 tiering, OLA/UC structures.