Search by job, company or skills

lyzr ai

Production Support Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Professional Services – Support Practice

Full-time

  • Remote (India / US / EU) Experience : 2-5 years

About Lyzr

Lyzr.ai's agentic AI platform powers intelligent, autonomous workflows for enterprise clients. Production Support Engineers are the front line that keeps those workflows healthy — triaging incidents, resolving tickets, digging into logs, and escalating the right issues to the right teams before clients feel the pain.

This role suits someone who thrives in a fast-paced technical environment, takes ownership seriously, and genuinely enjoys the detective work of diagnosing why something broke in production. You will work within a global follow-the-sun support model, reporting to the Production Support Lead.

What You'll Do

Incident response & triage

  • Monitor production dashboards and alerts; acknowledge, classify (P1–P3), and triage incoming incidents within SLA response windows.
  • Perform first-level diagnosis using logs, traces, and monitoring tools (Datadog / Grafana / CloudWatch) to isolate root cause or rule out environmental issues.
  • Execute approved runbook steps to resolve known issues independently; escalate novel or high-severity issues to the Lead with a clear diagnostic summary.
  • Maintain accurate, time-stamped ticket updates throughout the incident lifecycle so clients and internal stakeholders always have visibility.

Service request fulfilment

  • Handle client service requests: configuration changes, access provisioning, agent re-deployments, and data queries within approved change management guardrails.
  • Validate and document completed requests, ensuring audit trails are maintained in the ticketing system.
  • Identify recurring requests that could be automated or self-served, and flag them to the Lead for process improvement.

Monitoring & proactive health checks

  • Run scheduled health checks on production agent pipelines, API integrations, and data connectors; raise pre-emptive alerts for degradation trends.
  • Maintain and update monitoring dashboards; propose new alert thresholds based on observed patterns.
  • Participate in post-mortems and contribute findings to the known-error database and runbooks.

Knowledge & collaboration

  • Document solutions to new issues in the internal knowledge base; keep existing runbooks accurate and up to date.
  • Collaborate with Engineering, Platform, and Customer Success teams during handoffs, providing clear reproduction steps and log artefacts.
  • Participate in the on-call rotation (shift-based); expected availability for P1 escalations during assigned windows.

What You Bring

Experience: 2–5 years in application / production support or a NOC environment

Domain: SaaS or cloud-hosted platform support; AI/ML familiarity a strong plus

Technical: Log analysis, API debugging, SQL queries, basic Python / shell scripting

Monitoring: Datadog, Grafana, CloudWatch, or equivalent observability tools

Ticketing: Jira Service Management, ServiceNow, or Zendesk

Cloud basics: AWS / GCP / Azure fundamentals; Docker / Kubernetes awareness

Additionally, You Will Have

  • A methodical, structured approach to troubleshooting — you document what you tried, not just what worked.
  • Clear written communication: ticket updates, client-facing messages, and handover notes that leave no ambiguity.
  • Comfort working across time zones and collaborating asynchronously with distributed teams.
  • Bonus: exposure to LLM-based or agentic AI systems, prompt engineering, or RAG pipelines in production.
  • Bonus: ITIL Foundation certification or equivalent incident management training.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 146430269

Similar Jobs