Summary:
The Senior ML Ops Engineer is responsible for defining and owning the MLOps architecture for deep learning systems across the organization. This role involves designing and implementing end-to-end ML pipelines, building and maintaining CI/CD pipelines, and establishing model serving infrastructure. The engineer will ensure reliability, scalability, and reproducibility of ML workflows and experiments while managing model versioning and artifact storage.
Responsibilities:
- Define and own the overall MLOps architecture for deep learning systems across the organization.
- Design and implement end-to-end ML pipelines for data ingestion, training, validation, deployment, and monitoring.
- Build and maintain CI/CD pipelines for automated model training, evaluation, and deployment.
- Establish model serving infrastructure, including scalable and reliable real-time or batch inference pipelines.
- Implement model monitoring, data drift detection, performance observability, and alerting frameworks.
- Ensure reliability, scalability, and reproducibility of ML workflows and experiments.
- Manage model versioning, artifact storage, and experiment tracking.
- Design and maintain a data aggregation and data ingestion solution for large-scale vision datasets.
- Build data pipelines, feature stores, and dataset validation frameworks.
- Collaborate with algorithm and deep learning teams to transition R&D models into production-grade pipelines.
- Work with cloud platforms to deploy scalable ML systems.
- Build training and inference solutions using Azure ML, AWS SageMaker, or GCP Vertex AI.
- Implement containerized ML services using Docker and Kubernetes.
- Mentor junior engineers and guide teams as the technical authority for MLOps and ML lifecycle management.
- Collaborate closely with algorithm developers, CV engineers, data engineers, and platform teams.
Requirements:
- Bachelor&rsquos/Master&rsquos degree in Computer Science, Engineering, or related field.
- 7 years of experience in MLOps, Computer Vision, and Python.
Required Skills:
- Strong understanding of ML workflow orchestration, lifecycle management, and platform design.
- Advanced Python skills and proficiency in C.
- Hands-on experience with PyTorch, TensorFlow, scikit-learn.
- Experience with MLflow, Kubeflow, TFX, DAG-based workflow engines.
- Experience designing data ingestion pipelines, dataset management systems, and feature stores.
- Hands-on experience with Azure ML, AWS SageMaker, or equivalent production ML platforms.
- Strong understanding of Docker, Kubernetes, GitLab CI/GitHub Actions.
Preferred Skills:
- Demonstrated technical leadership on complex ML systems.
- Excellent problem-solving, communication, and collaboration skills.
- Ability to operate independently and drive architectural decisions.
#AditiConsulting
# 26-03493