Description
About the role :
The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.
What Youll Do
Incident Management & Leadership :
- Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution
- Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI
- Establish and refine incident response playbooks, runbooks, and escalation procedures
- Participate in on-call rotation as primary/secondary responder with accountability for critical systems
Backend & Infrastructure Expertise
- Perform deep production troubleshooting : log analysis, distributed tracing, metric correlation, and profiling under pressure
- Diagnose and fix complex issues across microservices : scheduling engine, LLM orchestration, notification pipeline, and integrations
- Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints
- Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale
- Work with Kubernetes, container orchestration, and deployment pipelines manage rollbacks and feature toggles during incidents
Full-Stack Incident Resolution
- Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)
- Debug and ship targeted React fixes when UI is the fastest path to incident resolution
- Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling
- Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes
Observability & Continuous Improvement
- Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform
- Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge
- Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices
- Drive blameless post-mortems and systemic risk reduction across the platform
On your first day, we'll expect you to have :
Backend (Primary Focus)
- 5 - 8+ years in backend engineering with strong hands-on experience in Python/FastAPI or equivalent
- Deep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed scheduling
- Production-grade database skills : PostgreSQL query optimization, locking, migrations, and performance tuning
- Redis expertise : caching patterns, rate limiting, streams, and pub/sub for realtime systems
- Strong observability and on-call mindset : designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitions
- Proficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)
- Understanding of cloud infrastructure (Azure preferred) and networking fundamentals
LLM & Agentic Systems
- Solid grasp of LLM orchestration concepts : prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behavior
- Experience with LLM failure modes : hallucinations, token limits, timeout patterns, and cost/latency tradeoffs
- Knowledge of agent frameworks (LangGraph, similar) and how they compose across microservices
- Ability to debug LLM-driven flows : tracing prompts, understanding retry/backoff behavior, and validating tool outputs
Full-Stack (Secondary But Required)
- 2 - 3+ years hands-on with React and TypeScript in production environments
- Competency reading and modifying existing React code : components, hooks, routing, state management (Redux/Context)
- Browser debugging skills : DevTools, React DevTools, network throttling, and performance profiling
- Ability to implement targeted UI fixes : form validation, error handling, API error display, and minor UX hardening
- Familiarity with frontend build pipelines : Webpack/Vite, environment configs, feature flags, and deployment strategies
Logging, Metrics & Troubleshooting
- Expert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)
- Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)
- Ability to construct and execute production queries under incident time pressure
- Strong shell scripting (bash/Python) for diagnostics, automation, and custom monitoring
Required Soft Skills
- Incident command maturity : composure under pressure, clear communication, and decisive decision-making during critical outages
- Technical depth with breadth : deep backend knowledge + sufficient full-stack awareness to own end-to-end incidents
- Mentorship mindset : capable of raising L2 engineers through code review, pairing, and RCA participation
- Documentation discipline : ability to capture runbooks, architecture decisions, and lessons learned clearly
- Cross-functional collaboration : working effectively with dev, SRE, platform, and business teams during incidents
Experience Requirements
- Minimum 6 - 10 years in backend/platform/SRE roles with at least 3+ years in production support, incident response, or on-call engineering
- Proven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systems
- Experience with at least one agentic AI or LLM-integrated product (customer facing or internal tools)
- Comfortable with continuous on-call rotation and on-demand availability for critical incident
(ref:hirist.tech)