Key Responsibilities
- Design, implement, and maintain end-to-end MLOps pipelines for model training, validation, deployment, and monitoring.
- Build and manage LLMOps pipelines for fine-tuning, evaluating, and deploying large language models (e.g., OpenAI, HuggingFace Transformers, custom LLMs).
- Use Kubeflow and Kubernetes to orchestrate reproducible, scalable ML/LLM workflows.
- Implement CI/CD pipelines for ML projects using GitHub Actions , Argo Workflows , or Jenkins .
- Automate infrastructure provisioning using Terraform , Helm , or similar IaC tools.
- Integrate model registry and artifact management with tools like MLflow , Weights & Biases , or DVC .
- Manage containerization with Docker and container orchestration via Kubernetes .
- Set up monitoring , logging , and alerting for production models using tools like Prometheus , Grafana , and ELK Stack .
- Collaborate closely with Data Scientists and DevOps engineers to ensure seamless integration of models into production systems.
- Ensure model governance, reproducibility, auditability, and compliance with enterprise and legal standards.
- Conduct performance profiling, load testing, and cost optimization for LLM inference endpoints.
Required Skills and Experience
- Core MLOps/LLMOps Expertise
- 5+ years of hands-on experience in MLOps/DevOps for AI/ML.
- 2+ years working with LLMs in production (e.g., fine-tuning, inference optimization, safety evaluations).
- Strong experience with Kubeflow Pipelines , KServe , and MLflow .
- Deep knowledge of CI/CD pipelines with GitHub Actions , GitLab CI , or CircleCI .
- Expert in Kubernetes , Helm , and Terraform for container orchestration and infrastructure as code.
- Programming & Frameworks
- Proficient in Python , with experience in ML libraries such as scikit-learn , TensorFlow , PyTorch , Hugging Face Transformers .
- Familiarity with FastAPI , Flask , or gRPC for building ML model APIs.
- Cloud & DevOps
- Hands-on with AWS , Azure , or GCP (preferred: EKS, S3, SageMaker, Vertex AI, Azure ML).
- Knowledge of model serving using Triton Inference Server , TorchServe , or ONNX Runtime .
- Monitoring & Logging
- Tools: Prometheus , Grafana , ELK , OpenTelemetry , Sentry .
- Model drift detection and A/B testing in production environments.
Soft Skills
- Strong problem-solving and debugging skills.
- Ability to mentor junior engineers and collaborate with cross-functional teams.
- Clear communication, documentation, and Agile/Scrum proficiency.
Preferred Qualifications
- Experience with LLMOps platforms like Weights & Biases , TruEra , PromptLayer , LangSmith .
- Experience with multi-tenant LLM serving or agentic systems (LangChain, Semantic Kernel).
- Prior exposure to Responsible AI practices (bias detection, explainability, fairness).