Search by job, company or skills

Calfus Inc.

Calfus Inc - L3 Production Support Engineer

6-10 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted a month ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Description

About the role :

The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.

What Youll Do

Incident Management & Leadership :

  • Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution
  • Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI
  • Establish and refine incident response playbooks, runbooks, and escalation procedures
  • Participate in on-call rotation as primary/secondary responder with accountability for critical systems

Backend & Infrastructure Expertise

  • Perform deep production troubleshooting : log analysis, distributed tracing, metric correlation, and profiling under pressure
  • Diagnose and fix complex issues across microservices : scheduling engine, LLM orchestration, notification pipeline, and integrations
  • Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints
  • Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale
  • Work with Kubernetes, container orchestration, and deployment pipelines manage rollbacks and feature toggles during incidents

Full-Stack Incident Resolution

  • Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)
  • Debug and ship targeted React fixes when UI is the fastest path to incident resolution
  • Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling
  • Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes

Observability & Continuous Improvement

  • Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform
  • Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge
  • Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices
  • Drive blameless post-mortems and systemic risk reduction across the platform

On your first day, we'll expect you to have :

Backend (Primary Focus)

  • 5 - 8+ years in backend engineering with strong hands-on experience in Python/FastAPI or equivalent
  • Deep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed scheduling
  • Production-grade database skills : PostgreSQL query optimization, locking, migrations, and performance tuning
  • Redis expertise : caching patterns, rate limiting, streams, and pub/sub for realtime systems
  • Strong observability and on-call mindset : designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitions
  • Proficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)
  • Understanding of cloud infrastructure (Azure preferred) and networking fundamentals

LLM & Agentic Systems

  • Solid grasp of LLM orchestration concepts : prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behavior
  • Experience with LLM failure modes : hallucinations, token limits, timeout patterns, and cost/latency tradeoffs
  • Knowledge of agent frameworks (LangGraph, similar) and how they compose across microservices
  • Ability to debug LLM-driven flows : tracing prompts, understanding retry/backoff behavior, and validating tool outputs

Full-Stack (Secondary But Required)

  • 2 - 3+ years hands-on with React and TypeScript in production environments
  • Competency reading and modifying existing React code : components, hooks, routing, state management (Redux/Context)
  • Browser debugging skills : DevTools, React DevTools, network throttling, and performance profiling
  • Ability to implement targeted UI fixes : form validation, error handling, API error display, and minor UX hardening
  • Familiarity with frontend build pipelines : Webpack/Vite, environment configs, feature flags, and deployment strategies

Logging, Metrics & Troubleshooting

  • Expert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)
  • Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)
  • Ability to construct and execute production queries under incident time pressure
  • Strong shell scripting (bash/Python) for diagnostics, automation, and custom monitoring

Required Soft Skills

  • Incident command maturity : composure under pressure, clear communication, and decisive decision-making during critical outages
  • Technical depth with breadth : deep backend knowledge + sufficient full-stack awareness to own end-to-end incidents
  • Mentorship mindset : capable of raising L2 engineers through code review, pairing, and RCA participation
  • Documentation discipline : ability to capture runbooks, architecture decisions, and lessons learned clearly
  • Cross-functional collaboration : working effectively with dev, SRE, platform, and business teams during incidents

Experience Requirements

  • Minimum 6 - 10 years in backend/platform/SRE roles with at least 3+ years in production support, incident response, or on-call engineering
  • Proven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systems
  • Experience with at least one agentic AI or LLM-integrated product (customer facing or internal tools)
  • Comfortable with continuous on-call rotation and on-demand availability for critical incident

(ref:hirist.tech)

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 142899837