Key Responsibilities:
- Design, build, and maintain CI/CD pipelines for ML model training, validation, and deployment
- Automate and optimize ML workflows, including data ingestion, feature engineering, model training, and monitoring
- Deploy, monitor, and manage LLMs and other ML models in production (on-premises and/or cloud)
- Implement model versioning, reproducibility, and governance best practices
- Collaborate with data scientists, ML engineers, and software engineers to streamline end-to-end ML lifecycle
- Ensure security, compliance, and scalability of ML/LLM infrastructure
- Troubleshoot and resolve issues related to ML model deployment and serving
- Evaluate and integrate new MLOps/LLMOps tools and technologies
- Mentor junior engineers and contribute to best practices documentation
Required Skills & Qualifications:
- 8+ years of experience in DevOps, with at least 3 years in MLOps/LLMOps
- Strong experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker)
- Proficient in CI/CD tools (Jenkins, GitHub Actions, GitLab CI, etc.)
- Hands-on experience deploying and managing different types of AI models (e.g., OpenAI, HuggingFace, custom models) to be used for developing solutions.
- Experience with model serving tools such as TGI, vLLM, BentoML, etc.
- Solid scripting and programming skills (Python, Bash, etc.)
- Familiarity with monitoring/logging tools (Prometheus, Grafana, ELK stack)
- Strong understanding of security and compliance in ML environments
Preferred Skills:
- Knowledge of model explainability, drift detection, and model monitoring
- Familiarity with data engineering tools (Spark, Kafka, etc.
- Knowledge of data privacy, security, and compliance in AI systems.
- Strong communication skills to effectively collaborate with various stakeholders
- Critical thinking and problem-solving skills are essential
- Proven ability to lead and manage projects with cross-functional teams