Senior Cloud & ML Infrastructure Engineer
Location:Bangalore / Bengaluru, Hyderabad, Pune, Mumbai, Mohali, Panchkula, Delhi
Experience: 610+ Years
Night Shift - 9 pm to 6 am
About The Role
We're looking for a Senior Cloud & ML Infrastructure Engineer to lead the design,
scaling, and optimization of cloud-native machine learning infrastructure. This role is ideal for
someone passionate about solving complex platform engineering challenges across AWS, with
a focus on model orchestration, deployment automation, and production-grade reliability. You'll
architect ML systems at scale, provide guidance on infrastructure best practices, and work
cross-functionally to bridge DevOps, ML, and backend teams.
Key Responsibilities
- Architect and manage end-to-end ML infrastructure using SageMaker, AWS Step
Functions, Lambda, and ECR
- Design and implement multi-region, highly-available AWS solutions for real-time
inference and batch processing
- Create and manage IaC blueprints for reproducible infrastructure using AWS CDK
- Establish CI/CD practices for ML model packaging, validation, and drift monitoring
- Oversee infrastructure security, including IAM policies, encryption at rest/in-transit, and
compliance standards
- Monitor and optimize compute/storage cost, ensuring efficient resource usage at scale
- Collaborate on data lake and analytics integration
- Serve as a technical mentor and guide AWS adoption patterns across engineering
teams
Required Skills
- 6+ years designing and deploying cloud infrastructure on AWS at scale
- Proven experience building and maintaining ML pipelines with services like SageMaker,
ECS/EKS, or custom Docker pipelines
- Strong knowledge of networking, IAM, VPCs, and security best practices in AWS
- Deep experience with automation frameworks, IaC tools, and CI/CD strategies
- Advanced scripting proficiency in Python, Go, or Bash
- Familiarity with observability stacks (CloudWatch, Prometheus, Grafana)
Nice To Have
- Background in robotics infrastructure, including AWS IoT Core, Greengrass, or OTA
deployments
- Experience designing systems for physical robot fleet telemetry, diagnostics, and control
- Familiarity with multi-stage production environments and robotic software rollout
processes
- Competence in frontend hosting for dashboard or API visualization
- Involvement with real-time streaming, MQTT, or edge inference workflows
- Hands-on experience with ROS 2 (Robot Operating System) or similar robotics
frameworks, including launch file management, sensor data pipelines, and deployment
to embedded Linux device
Skills: sagemaker,ci,ros,aws,lambda,aws step functions,cloud,pipelines,ml,cd