Job Description
The role
You'll design, build, and productionize AI features end-to-endmodel selection/finetuning, retrieval pipelines, evaluations, and the serving layerworking closely with product and platform teams. If you enjoy turning ambiguous problems into shipping systems, you'll feel at home here.
What you'll do
Own AI feature delivery from prototype production (design, implement, evaluate, iterate).
Build RAG pipelines (chunking, embeddings, vector stores), prompt/program orchestration, and guardrails.
Fine-tune and/or distill models (open/closed source) for classification, generation, and tool-use.
Implement robust offline & online evals (unit evals, golden sets, regression tests, user-feedback loops).
Ship reliable services: APIs, workers, model servers, and monitoring/observability (latency, cost, quality).
Partner with product/design to shape problem statements, success metrics, and experiment plans.
Champion engineering best practices (reviews, testing, docs, incident learnings).
Tech you might use here
Languages: Python, TypeScript/Node.
AI/ML: PyTorch, Hugging Face, OpenAI/Anthropic/other LLM APIs, vLLM/TensorRT-LLM, LangChain/LlamaIndex (pragmatically).
Data & Retrieval: Postgres, Redis, Milvus/pgvector/Weaviate, Kafka.
Infra: Docker, Kubernetes, CI/CD, Grafana/Prometheus, cloud (AWS/GCP).
Quality: Prompt/unit tests, offline eval harnesses, canary analysis, A/B testing.
We're looking for
37+ years of software engineering experience, with 13+ in applied ML/LLM or search/retrieval.
Strong Python engineering (typing, testing, packaging) and service design (APIs, queues, retries, idempotency).
Hands-on with at least two of: RAG in prod, finetuning (LoRA/QLoRA), embeddings/annoy/hnsw, function/tool calling, or model serving at scale.
Practical evaluation mindset: create golden datasets, design metrics (accuracy, faithfulness, toxicity, latency, cost).
Product sense and ownership: you measure impact, not just model scores.
Clear communication and collaborative habits (PRs, design docs, incident notes).
Nice to have
Experience with multi-tenant architectures, RBAC/ABAC, and data governance.
Safety & reliability work (red-teaming, jailbreak defenses, PII handling).
Frontend familiarity (React) to iterate quickly on UX for AI features.
Prior startup experience or 01 product building.
What success looks like (first 90 days)
Ship a scoped AI feature into customer hands with an eval harness and dashboards.
Reduce either latency or cost of an existing pipeline by 2030% without quality loss.
Add at least one reusable internal component (chunker, ranker, guardrail, eval set).
Interview process
Intro chat (30 min): role fit & expectations.
Technical deep-dive (60 min): systems + ML/LLM problem solving.
Practical exercise (take-home or pairing, 34 hrs): build a small RAG/eval pipeline.
Final loop (6090 min): product & culture, past work, offer Q&A.
Example exercise (high-level brief)
Build a minimal retrieval-augmented QA service over a small doc set. Include: chunking strategy, embedding store, answer generation, and an eval set (1020 Q/A) with simple metrics (EM/F1/faithfulness). Provide a short readme with trade-offs and cost/latency numbers.
Benefits
Competitive compensation with performance bonuses.
flexible hours (IST core overlap).
Learning stipend & modern hardware.
Fast path to ownership and visible impact.