As a Lead MLOps Engineer, you will play a pivotal role in building scalable and reliable machine learning infrastructure for enterprise-grade applications. We are looking for a Lead Data Engineer with strong exposure to MLOps practices, ideally someone with a core data engineering background who has worked on large-scale data platforms. This is a hybrid role that blends big data engineering with end-to-end model lifecycle management- from development and deployment to monitoring and retraining. The ideal candidate will bring hands-on experience with Databricks, PySpark, and the orchestration of production-grade ML pipelines, enabling efficient and resilient solutions in dynamic, data-driven environments.
Roles and Responsibilities
- Design and implement distributed data processing pipelines using PySpark.
- Collaborate with business architects and stakeholders to design scalable data and ML workflows.
- Optimize performance of Spark applications through tuning, resource management, and caching strategies.
- Debug long-running Spark jobs using Spark UI; address OOM errors, data skew, shuffle issues, and job retries.
- Manage model deployment workflows using tools like MLflow for tracking, versioning, and registry.
- Build and maintain CI/CD pipelines for both data and ML workflows.
- Containerize applications using Docker and orchestrate using tools like Kubernetes.
- Monitor production models, manage retraining workflows, and handle dependency management.
- Contribute to clean, collaborative Git workflows with practices such as branching, rebasing, and PR reviews.
- Work across teams to ensure models are production-ready, scalable, and aligned with business goals.
- Develop and orchestrate big data workflows on Databricks.
- Work on at least one cloud platform (preferably Azure) for scalable data and ML solutions.
Required Skills and Experience:
- Proficient in PySpark, with strong experience in Spark performance tuning and optimization.
- Strong expertise in Databricks for development, orchestration, and job monitoring.
- Working knowledge of MLflow or similar tools for model lifecycle management.
- Proficient in Python and SQL.
- Deep understanding of distributed data systems, job scheduling, and fault tolerance.
- Experience in working with structured/unstructured data formats like Parquet, Delta, and JSON.
- Familiarity with feature stores, model monitoring, drift detection, and automated retraining workflows.
- Strong command over Git and version control in multi-developer environments.
- Experience with CI/CD tools for data and ML pipelines.
- Knowledge of containerization (Docker) and orchestration (Kubernetes) is a plus.
- Experience with at least one major cloud platform (Azure preferred, or AWS/GCP).