Calfus Inc - L3 Production Support Engineer

Calfus Inc.

Pune, India

6-10 Years

Save

Posted a month ago
Be among the first 10 applicants

Early Applicant

Job Description

Description

About the role :

The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.

What Youll Do

Incident Management & Leadership :

Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution
Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI
Establish and refine incident response playbooks, runbooks, and escalation procedures
Participate in on-call rotation as primary/secondary responder with accountability for critical systems

Backend & Infrastructure Expertise

Perform deep production troubleshooting : log analysis, distributed tracing, metric correlation, and profiling under pressure
Diagnose and fix complex issues across microservices : scheduling engine, LLM orchestration, notification pipeline, and integrations
Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints
Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale
Work with Kubernetes, container orchestration, and deployment pipelines manage rollbacks and feature toggles during incidents

Full-Stack Incident Resolution

Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)
Debug and ship targeted React fixes when UI is the fastest path to incident resolution
Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling
Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes

Observability & Continuous Improvement

Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform
Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge
Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices
Drive blameless post-mortems and systemic risk reduction across the platform

On your first day, we'll expect you to have :

Backend (Primary Focus)

5 - 8+ years in backend engineering with strong hands-on experience in Python/FastAPI or equivalent
Deep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed scheduling
Production-grade database skills : PostgreSQL query optimization, locking, migrations, and performance tuning
Redis expertise : caching patterns, rate limiting, streams, and pub/sub for realtime systems
Strong observability and on-call mindset : designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitions
Proficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)
Understanding of cloud infrastructure (Azure preferred) and networking fundamentals

LLM & Agentic Systems

Solid grasp of LLM orchestration concepts : prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behavior
Experience with LLM failure modes : hallucinations, token limits, timeout patterns, and cost/latency tradeoffs
Knowledge of agent frameworks (LangGraph, similar) and how they compose across microservices
Ability to debug LLM-driven flows : tracing prompts, understanding retry/backoff behavior, and validating tool outputs

Full-Stack (Secondary But Required)

2 - 3+ years hands-on with React and TypeScript in production environments
Competency reading and modifying existing React code : components, hooks, routing, state management (Redux/Context)
Browser debugging skills : DevTools, React DevTools, network throttling, and performance profiling
Ability to implement targeted UI fixes : form validation, error handling, API error display, and minor UX hardening
Familiarity with frontend build pipelines : Webpack/Vite, environment configs, feature flags, and deployment strategies

Logging, Metrics & Troubleshooting

Expert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)
Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)
Ability to construct and execute production queries under incident time pressure
Strong shell scripting (bash/Python) for diagnostics, automation, and custom monitoring

Required Soft Skills

Incident command maturity : composure under pressure, clear communication, and decisive decision-making during critical outages
Technical depth with breadth : deep backend knowledge + sufficient full-stack awareness to own end-to-end incidents
Mentorship mindset : capable of raising L2 engineers through code review, pairing, and RCA participation
Documentation discipline : ability to capture runbooks, architecture decisions, and lessons learned clearly
Cross-functional collaboration : working effectively with dev, SRE, platform, and business teams during incidents

Experience Requirements

Minimum 6 - 10 years in backend/platform/SRE roles with at least 3+ years in production support, incident response, or on-call engineering
Proven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systems
Experience with at least one agentic AI or LLM-integrated product (customer facing or internal tools)
Comfortable with continuous on-call rotation and on-demand availability for critical incident

(ref:hirist.tech)