MLOPs and LLMOps Engineer

8-12 Years

Save

Early Applicant

Quick Apply

Job Description

Key Responsibilities

Design, implement, and maintain end-to-end MLOps pipelines for model training, validation, deployment, and monitoring.
Build and manage LLMOps pipelines for fine-tuning, evaluating, and deploying large language models (e.g., OpenAI, HuggingFace Transformers, custom LLMs).
Use Kubeflow and Kubernetes to orchestrate reproducible, scalable ML/LLM workflows.
Implement CI/CD pipelines for ML projects using GitHub Actions , Argo Workflows , or Jenkins .
Automate infrastructure provisioning using Terraform , Helm , or similar IaC tools.
Integrate model registry and artifact management with tools like MLflow , Weights & Biases , or DVC .
Manage containerization with Docker and container orchestration via Kubernetes .
Set up monitoring , logging , and alerting for production models using tools like Prometheus , Grafana , and ELK Stack .
Collaborate closely with Data Scientists and DevOps engineers to ensure seamless integration of models into production systems.
Ensure model governance, reproducibility, auditability, and compliance with enterprise and legal standards.
Conduct performance profiling, load testing, and cost optimization for LLM inference endpoints.

Required Skills and Experience

Core MLOps/LLMOps Expertise
5+ years of hands-on experience in MLOps/DevOps for AI/ML.
2+ years working with LLMs in production (e.g., fine-tuning, inference optimization, safety evaluations).
Strong experience with Kubeflow Pipelines , KServe , and MLflow .
Deep knowledge of CI/CD pipelines with GitHub Actions , GitLab CI , or CircleCI .
Expert in Kubernetes , Helm , and Terraform for container orchestration and infrastructure as code.
Programming & Frameworks
Proficient in Python , with experience in ML libraries such as scikit-learn , TensorFlow , PyTorch , Hugging Face Transformers .
Familiarity with FastAPI , Flask , or gRPC for building ML model APIs.
Cloud & DevOps
Hands-on with AWS , Azure , or GCP (preferred: EKS, S3, SageMaker, Vertex AI, Azure ML).
Knowledge of model serving using Triton Inference Server , TorchServe , or ONNX Runtime .
Monitoring & Logging
Tools: Prometheus , Grafana , ELK , OpenTelemetry , Sentry .
Model drift detection and A/B testing in production environments.

Soft Skills

Preferred Qualifications

Experience with LLMOps platforms like Weights & Biases , TruEra , PromptLayer , LangSmith .
Experience with multi-tenant LLM serving or agentic systems (LangChain, Semantic Kernel).
Prior exposure to Responsible AI practices (bias detection, explainability, fairness).