Responsibilities
- Deploy and operate LLMs on GPUs (NVIDIA, cloud or on-prem).
- Run and tune inference servers such as vLLM, TGI, SGLang, Triton, or equivalents.
- Make capacity planning decisions: How many GPUs are required for X RPS, When to shard, batch, or queue, How to balance latency vs throughput vs cost.
- Design clean, production-grade APIs (FastAPI / gRPC / REST) that expose AI capabilities.
- Handle: request validation batching timeouts streaming responses multi-tenant isolation.
- Abstract model details so downstream teams don't deal with GPU / model complexity.
Design For High Concurrency And Asynchronous Execution
- Architect systems that do not assume synchronous execution.
- Use queues, workers, and async pipelines for: long-running inference multi-step AI workflows fan-out / fan-in patterns
- Reason clearly about: backpressure retries idempotency failure isolation
Scale AI Systems Reliably
- Decide when to scale vertically vs horizontally.
- Understand GPU utilization, memory constraints, and contention.
- Implement:
- autoscaling logic
- rate limiting
- admission control
- graceful degradation under load
- Build observability around:
- latency
- queue depth
- GPU utilization
- error modes specific to LLMs
Collaborate With AI Researchers (Without Becoming One)
- Provide infra, abstractions, and guardrails so researchers can:
- swap models
- test fine-tuned variants
- ship improvements safely
- Translate research artifacts into production systems.
Requirements
- Strong backend engineering fundamentals (distributed systems, async execution).
- Proficiency in Python (Golang acceptable).
- Hands-on experience running GPU workloads in production.
- Experience designing APIs over compute-heavy systems.
- Comfort with Docker + Kubernetes (or equivalent orchestration).
- Practical understanding of queues, workers, and background processing.
- Ability to reason quantitatively about throughput, latency, and cost.
Strong Signals (These Matter More Than Buzzwords)
- You have personally:
- deployed a model on GPU
- debugged GPU memory / OOM issues
- handled inference latency regressions
- redesigned a synchronous system into an async one
- You can explain:
- how batching affects latency
- why GPU utilization is often low
- how to prevent a single slow request from killing throughput
Nice To Have (But Not a Substitute)
- Familiarity with: LangChain / LlamaIndex (as orchestration layers, not magic), Vector DBs (Qdrant, Pinecone, Weaviate), CI/CD for ML or AI systems
- Experience in multi-tenant enterprise systems.
What You Should Have Done In The Past
- Built or operated production AI or ML systems, not just demos.
- Scaled a backend system where compute is the bottleneck.
- Designed systems that continue to work under load and partial failure.
- Worked closely with researchers or data scientists to productionize models.
Who Will NOT Be a Good Fit
- Engineers who have only used hosted APIs (OpenAI / Anthropic) and never run models themselves.
- Candidates whose experience is limited to prompt engineering or UI-level AI features.
- Engineers uncomfortable reasoning about infrastructure, queues, and capacity.
This job was posted by Utkarsh Apoorva from AbleCredit.