Lead the architecture, development, and deployment of scalable machine learning systems, focusing on real-time inference for LLMs serving multiple concurrent users.
Optimize inference pipelines using high-performance frameworks like vLLM, Groq, ONNX Runtime, Triton Inference Server, and TensorRT to minimize latency and cost.
Design and implement agentic AI systems utilizing frameworks such as LangChain, AutoGPT, and ReAct for autonomous task orchestration.
Fine-tune, integrate, and deploy foundation models including GPT, LLaMA, Claude, Mistral, Falcon, and others into intelligent applications.
Develop and maintain robust MLOps workflows to manage the full model lifecycle including training, deployment, monitoring, and versioning.
Collaborate with DevOps teams to implement scalable serving infrastructure leveraging containerization (Docker), orchestration (Kubernetes), and cloud platforms (AWS, GCP, Azure).
Implement retrieval-augmented generation (RAG) pipelines integrating vector databases like FAISS, Pinecone, or Weaviate.
Build observability systems for LLMs to track prompt performance, latency, and user feedback.
Work cross-functionally with research, product, and operations teams to deliver production-grade AI systems handling real-world traffic patterns.
Stay updated on emerging AI trends, hardware acceleration techniques, and contribute to open-source or research initiatives where possible.