We are looking for a senior AI/ML engineer who can take AI features from prototype to production with confidence. You will own the full lifecycle of our LLM-powered systems, from benchmarking model and pipeline performance to hardening the stack for scale to shipping it live to real users. This role sits at the intersection of applied LLM/GenAI work and MLOps and is critical to how quickly and reliably we can put new AI capabilities in front of customers. You will work closely with product, backend, and design to make sure what we ship is fast, accurate, cost-efficient, and observable in production.
Responsibilities
- Design, build, and ship LLM-powered features end-to-end, including RAG pipelines, agentic workflows, prompt orchestration, and fine-tuning where it makes sense.
- Define and run benchmarking frameworks for our AI applications: latency, throughput, accuracy, hallucination rate, cost per request, and quality regressions across model and prompt changes.
- Establish offline evals (golden sets, LLM-as-judge, human-in-the-loop) and online evals (A/B tests, shadow traffic, canary releases) before any model or prompt goes live.
- Take models and pipelines to production: containerise, deploy, autoscale, and monitor inference services with clear SLOs for latency, error rate, and cost.
- Build the MLOps backbone CI/CD for models and prompts, versioning, feature stores where needed, observability (traces, metrics, and logs), and rollback paths.
- Optimise inference performance and cost: batching, caching, quantisation, distillation, model routing, and choosing the right managed vs self-hosted trade-offs.
- Partner with product to translate fuzzy product asks into measurable AI quality bars and own the is this good enough to ship decision with data behind it.
- Mentor other engineers on LLM best practices, evaluation rigour, and production readiness.
Requirements
- 3-6 years of engineering experience, with a meaningful portion spent shipping ML or AI systems to production (not just notebooks or POCs).
- Strong hands-on experience with LLMs and GenAI: at least one production system using OpenAI / Anthropic / open-source models, plus practical experience with RAG, embeddings, vector stores, and prompt engineering.
- Solid MLOps foundation model serving (FastAPI, vLLM, Triton, SageMaker, or similar), containerisation (Docker, Kubernetes), and at least one cloud (AWS, GCP, or Azure).
- Demonstrated ability to benchmark systems rigorously: you can talk concretely about how you measured a model's quality and performance, what you optimised, and what you knowingly traded off.
- Strong Python skills; comfortable with PyTorch or TensorFlow and with frameworks like LangChain, LlamaIndex, or equivalents (or a clear point of view on why not to use them).
- Good engineering discipline: testing, code review, clear API design, and the instinct to add observability before it is needed.
- Comfortable owning the path to production: you have taken something live, watched it break, and fixed it, and you do not need a separate team to do that for you.
Bonus Points
- Experience fine-tuning or post-training open-source models (LoRA/QLoRA, DPO, RLHF).
- Worked with multimodal models (image, video, or audio generation/understanding).
- Built or contributed to an internal eval harness or LLM observability tooling.
- Experience with high QPS, low-latency inference at consumer scale.
- Open-source contributions or technical writing in the AI/ML space.
This job was posted by Dhanesh Sridhar from Zocket.