The ideal candidate will be responsible for maintaining product and industry knowledge. You will work in a team-oriented environment that accelerates operational efficiency.
Responsibilities
- Design, build, deploy, and maintain production-grade ML pipelines and workflows using AWS and Python, with a focus on reliability, scalability, and observability.
- Own and enhance the MLOps platform that automates the full ML model lifecyclefrom data annotation and training to inference, monitoring, and feedback loops.
- Collaborate closely with Data Scientists to productionize models, including packaging, versioning, deployment strategies, and performance optimization.
- Contribute to Agentic AI initiatives, including evaluation and deployment of MCP servers and related infrastructure components.
- Implement monitoring, logging, alerting, and CI/CD best practices for ML systems to ensure production stability and rapid issue resolution.
- Troubleshoot complex pipeline, infrastructure, and inference issues, performing root cause analysis and driving long-term fixes.
- Stay current with evolving MLOps practices, cloud-native ML tooling, and emerging AI infrastructure trends, and proactively introduce improvements.
- Participate in design reviews, technical discussions, and planning meetings; clearly communicate progress, risks, and trade-offs to stakeholders.
- Mentor interns and junior engineers by providing technical guidance, code reviews, and best practices.
Qualifications
- 36 years of hands-on experience building and operating ML or data platforms, with a strong focus on MLOps or ML infrastructure.
- Strong practical experience with AWS services such as Sagemaker, S3, EC2, Batch, Lambda, IAM, and monitoring tools.
- Proficiency in Python for building ML pipelines, automation, and infrastructure tooling.
- Solid understanding of the ML lifecycle, including training, evaluation, deployment, inference, and model monitoring.
- Experience with containerization (Docker) and familiarity with orchestration frameworks (e.g., Kubernetes or managed equivalents).
- Strong problem-solving skills and the ability to independently drive tasks in a fast-paced, evolving environment.
- Effective communication skills and experience collaborating across Data Science and Engineering teams.