Search by job, company or skills

AbleCredit - GenAI Infra for BFSI

AI Infrastructure & LLM Systems Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Responsibilities

  • Deploy and operate LLMs on GPUs (NVIDIA, cloud or on-prem).
  • Run and tune inference servers such as vLLM, TGI, SGLang, Triton, or equivalents.
  • Make capacity planning decisions: How many GPUs are required for X RPS, When to shard, batch, or queue, How to balance latency vs throughput vs cost.
  • Design clean, production-grade APIs (FastAPI / gRPC / REST) that expose AI capabilities.
  • Handle: request validation batching timeouts streaming responses multi-tenant isolation.
  • Abstract model details so downstream teams don't deal with GPU / model complexity.

Design For High Concurrency And Asynchronous Execution

  • Architect systems that do not assume synchronous execution.
  • Use queues, workers, and async pipelines for: long-running inference multi-step AI workflows fan-out / fan-in patterns
  • Reason clearly about: backpressure retries idempotency failure isolation

Scale AI Systems Reliably

  • Decide when to scale vertically vs horizontally.
  • Understand GPU utilization, memory constraints, and contention.
  • Implement:
  • autoscaling logic
  • rate limiting
  • admission control
  • graceful degradation under load
  • Build observability around:
  • latency
  • queue depth
  • GPU utilization
  • error modes specific to LLMs

Collaborate With AI Researchers (Without Becoming One)

  • Provide infra, abstractions, and guardrails so researchers can:
  • swap models
  • test fine-tuned variants
  • ship improvements safely
  • Translate research artifacts into production systems.

Requirements

  • Strong backend engineering fundamentals (distributed systems, async execution).
  • Proficiency in Python (Golang acceptable).
  • Hands-on experience running GPU workloads in production.
  • Experience designing APIs over compute-heavy systems.
  • Comfort with Docker + Kubernetes (or equivalent orchestration).
  • Practical understanding of queues, workers, and background processing.
  • Ability to reason quantitatively about throughput, latency, and cost.

Strong Signals (These Matter More Than Buzzwords)

  • You have personally:
  • deployed a model on GPU
  • debugged GPU memory / OOM issues
  • handled inference latency regressions
  • redesigned a synchronous system into an async one
  • You can explain:
  • how batching affects latency
  • why GPU utilization is often low
  • how to prevent a single slow request from killing throughput

Nice To Have (But Not a Substitute)

  • Familiarity with: LangChain / LlamaIndex (as orchestration layers, not magic), Vector DBs (Qdrant, Pinecone, Weaviate), CI/CD for ML or AI systems
  • Experience in multi-tenant enterprise systems.

What You Should Have Done In The Past

  • Built or operated production AI or ML systems, not just demos.
  • Scaled a backend system where compute is the bottleneck.
  • Designed systems that continue to work under load and partial failure.
  • Worked closely with researchers or data scientists to productionize models.

Who Will NOT Be a Good Fit

  • Engineers who have only used hosted APIs (OpenAI / Anthropic) and never run models themselves.
  • Candidates whose experience is limited to prompt engineering or UI-level AI features.
  • Engineers uncomfortable reasoning about infrastructure, queues, and capacity.

This job was posted by Utkarsh Apoorva from AbleCredit.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 137846631