Job Responsibilities:
- Evaluate and source appropriate cloud infrastructure solutions for machine learning needs, ensuring cost-effectiveness and scalability based on project requirements.
- Automate and manage the deployment of machine learning models into production environments, ensuring version control for models and datasets using tools like Docker and Kubernetes.
- Set up monitoring tools to track model performance and data drift, conduct regular maintenance, and implement updates for production models.
- Work closely with data scientists, software engineers, and stakeholders to align on project goals, facilitate knowledge sharing, and communicate findings and updates to cross-functional teams.
- Design, implement, and maintain scalable ML infrastructure, optimizing cloud and on-premise resources for training and inference.
- Document ML processes, pipelines, and best practices while preparing reports on model performance, resource utilization, and system issues.
- Provide training and support for team members on ML Ops tools and methodologies, and stay updated on industry trends and emerging technologies.
- Diagnose and resolve issues related to model performance, infrastructure, and data quality, implementing solutions to enhance model robustness and reliability.
Education, Technical Skills & Other Critical Requirement:
- 6+ years of relevant experience in AI/ analytics product & solution delivery
- Bachelor's/master's degree in an information technology/computer science/ Engineering or equivalent fields experience.
- Proficiency in frameworks such as TensorFlow, PyTorch, or Scikit-learn.
- Strong skills in Python and/or R; familiarity with Java, Scala, or Go is a plus.
- Experience with cloud services such as AWS, Azure, or Google Cloud Platform, particularly in ML services (e.g., AWS SageMaker, Azure ML).
- CI/CD tools (e.g., Jenkins, GitLab CI), containerization (e.g., Docker), and orchestration (e.g., Kubernetes).
- Experience with databases (SQL and NoSQL), data pipelines, ETL processes, ML pipeline orchestration (Airflow)
- Familiarity with monitoring and logging tools such as Prometheus, Grafana, or ELK stack.
- Proficient in using Git for version control.
- Strong analytical and troubleshooting abilities to diagnose and resolve issues effectively.
- Good communication skills for working with cross-functional teams and conveying technical concepts to non-technical stakeholders.
- Ability to manage multiple projects and prioritize tasks in a fast-paced environment.