Professional Services – Support Practice
Full-time
- Remote (India / US / EU) Experience : 2-5 years
About Lyzr
Lyzr.ai's agentic AI platform powers intelligent, autonomous workflows for enterprise clients. Production Support Engineers are the front line that keeps those workflows healthy — triaging incidents, resolving tickets, digging into logs, and escalating the right issues to the right teams before clients feel the pain.
This role suits someone who thrives in a fast-paced technical environment, takes ownership seriously, and genuinely enjoys the detective work of diagnosing why something broke in production. You will work within a global follow-the-sun support model, reporting to the Production Support Lead.
What You'll Do
Incident response & triage
- Monitor production dashboards and alerts; acknowledge, classify (P1–P3), and triage incoming incidents within SLA response windows.
- Perform first-level diagnosis using logs, traces, and monitoring tools (Datadog / Grafana / CloudWatch) to isolate root cause or rule out environmental issues.
- Execute approved runbook steps to resolve known issues independently; escalate novel or high-severity issues to the Lead with a clear diagnostic summary.
- Maintain accurate, time-stamped ticket updates throughout the incident lifecycle so clients and internal stakeholders always have visibility.
Service request fulfilment
- Handle client service requests: configuration changes, access provisioning, agent re-deployments, and data queries within approved change management guardrails.
- Validate and document completed requests, ensuring audit trails are maintained in the ticketing system.
- Identify recurring requests that could be automated or self-served, and flag them to the Lead for process improvement.
Monitoring & proactive health checks
- Run scheduled health checks on production agent pipelines, API integrations, and data connectors; raise pre-emptive alerts for degradation trends.
- Maintain and update monitoring dashboards; propose new alert thresholds based on observed patterns.
- Participate in post-mortems and contribute findings to the known-error database and runbooks.
Knowledge & collaboration
- Document solutions to new issues in the internal knowledge base; keep existing runbooks accurate and up to date.
- Collaborate with Engineering, Platform, and Customer Success teams during handoffs, providing clear reproduction steps and log artefacts.
- Participate in the on-call rotation (shift-based); expected availability for P1 escalations during assigned windows.
What You Bring
Experience: 2–5 years in application / production support or a NOC environment
Domain: SaaS or cloud-hosted platform support; AI/ML familiarity a strong plus
Technical: Log analysis, API debugging, SQL queries, basic Python / shell scripting
Monitoring: Datadog, Grafana, CloudWatch, or equivalent observability tools
Ticketing: Jira Service Management, ServiceNow, or Zendesk
Cloud basics: AWS / GCP / Azure fundamentals; Docker / Kubernetes awareness
Additionally, You Will Have
- A methodical, structured approach to troubleshooting — you document what you tried, not just what worked.
- Clear written communication: ticket updates, client-facing messages, and handover notes that leave no ambiguity.
- Comfort working across time zones and collaborating asynchronously with distributed teams.
- Bonus: exposure to LLM-based or agentic AI systems, prompt engineering, or RAG pipelines in production.
- Bonus: ITIL Foundation certification or equivalent incident management training.