Lead and manage a team of Engineers to deploy and monitor machine learning models in production.
Working with data engineers for designing data engineering pipelines and performs robust ETL processes to ensure reliable, highquality data for analytics and ML workloads.
Collaborate with cross-functional teams, including data science, engineering, and operations, to understand business requirements and translate them into scalable ML solutions.
Architect and implement end-to-end machine learning pipelines for model training, testing, deployment, and monitoring.
Establish best practices and standards for model versioning, deployment, and monitoring to ensure reliability, scalability, and performance.
Implement automated processes for model training, hyperparameter tuning, and model evaluation using tools such as Weight and Biases, MLflow, Kubeflow, or similar.
Design and implement infrastructure for scalable and efficient model serving and inference, leveraging technologies such as Kubernetes, Docker, and serverless computing.
Develop and maintain monitoring and alerting systems to detect model drift, performance degradation, and other issues in production.
Provide technical leadership and mentorship to team members, fostering their professional growth and development.
Stay current with emerging technologies and industry trends in machine learning engineering, and evaluate their potential impact on our processes and infrastructure.
Collaborate with product management to define requirements and priorities for machine learning model deployments and validation, ensuring alignment with business goals and objectives.
Implement monitoring and logging solutions to track model performance metrics, resource utilization, and system health, enabling proactive issue detection and resolution.
Lead efforts to optimize resource utilization and cost-effectiveness of machine learning infrastructure, including compute resources, storage, and data transfer.
Stay abreast of advancements in machine learning technologies, evaluating their applicability and potential impact on our AI Operations strategy and roadmap.
Foster a culture of innovation, collaboration, and continuous improvement within the AI Operations team, encouraging experimentation and learning from failures.
Qualifications
B.tech / M Tech in Computer Science, Electronics or related fields
8 Years +
Skills
Machine Learning, Software Development
Research and development, Technology strategy, Global Project Management, Team Management, Mentoring, Risk Management.
Desired Skills :
Masters or Bachelor's degree in Computer Science, Engineering, or related field
8+ years of experience in software engineering, data engineering, or related roles, with at least 2 years in a managerial or leadership role.
Experience in Designs and maintains scalable data engineering pipelines and performs robust ETL processes to ensure reliable, highquality data for analytics and ML workloads
Previous experience in a leadership or management role, with a track record of successfully leading technical teams and delivering high-impact projects.
Experience with version control systems (e.g., Git) and collaboration tools (e.g., GitHub, GitLab) for managing code repositories and facilitating team collaboration.
Familiarity with infrastructure as code (IaC) tools such as Terraform or CloudFormation for provisioning and managing cloud resources.
Knowledge of software development methodologies (e.g., Agile, DevOps) and best practices for building scalable and reliable software systems.
Ability to effectively communicate technical concepts and solutions to non-technical stakeholders, including executives, product managers, and business users.
Strong proficiency in Python, JAVA and related IDEs
Awareness of machine learning concepts, algorithms, and frameworks (e.g. TensorFlow, PyTorch, sci-kit-learn).
Experience with cloud platforms and services (e.g., Azure, AWS, GCP) for building and deploying machine learning applications.
Proficiency in containerization technologies (e.g., Docker) and orchestration tools (e.g., Kubernetes).
Hands-on experience with MLOps tools and platforms such as Weight and Biase, MLflow, Kubeflow, TFX, or similar.
Experience in DevOps and DevSecOps tools and practices
Strong problem-solving skills and ability to troubleshoot complex issues in production environments.
Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.