Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!
We are looking for a highly motivated Machine Learning Operations Engineer with 34 years of experience in building and deploying end-to-end ML products in production environments. The ideal candidate has a strong ML background in Binary/ Multi class Classification, Recommendation Chatbot Applications and deploying training/inference pipelines, with hands-on experience in CI/CD, monitoring, and Kubernetes deployments.
Key Responsibilities:
- Design, build, and deploy robust ML pipelines for training, fine-tuning, and inference of models (NLP-focused: NER, Classification).
- Develop and maintain CI/CD workflows for ML pipelines using Jenkins or similar tools, ensuring rapid and safe deployment to production.
- Implement model monitoring and alerting systems to track performance degradation and drift in real-time.
- Collaborate with cross-functional teams to retrain models on trigger events and integrate feedback loops into the ML lifecycle.
- Hands on with Helm deployment of ML Pipelines in Kubernetes cluster and optimize for scalable and resilient operations.
- Use MLflow, Kubeflow, and related tools for experiment tracking, model versioning, and reproducibility.
- Write clean, efficient, and scalable code in Python using frameworks such as PyTorch and CUDA.
- Experience with tuning, optimising LLM Applications performance in production.
Required Skills:
- Strong programming experience in Python and PyTorch.
- Hands-on experience with CI/CD pipelines using Jenkins.
- Proficient with Kubernetes for deploying and managing ML workloads.
- Experience with model training, fine-tuning, and inference pipeline development.
- Working knowledge of model monitoring and alerting systems (performance drift, latency, accuracy drop).
- Experience with MLflow, Kubeflow, and model versioning best practices.
- Solid understanding of NER, Text Classification, and common NLP tasks.
- Familiarity with CUDA for training models on GPU.
Good to Have:
- Experience with Generative AI systems in production.
- Prior experience with building or deploying applications in Hardwares such as L40S, H100, H200.
- Familiarity with LangChain, LangGraph, LangSmith for building LLM-powered agents and applications.