The DevOps Engineer/Support Lead will be responsible for managing, automating, and optimizing cloud infrastructure, CI/CD workflows, and machine learning operations. The role requires strong expertise in AWS, Kubernetes, IaC, automation, and collaboration with data scientists and engineers to operationalize ML models. The position combines cloud engineering, DevOps, and MLOps responsibilities to ensure scalable, reliable, and secure environments.
What will you do
- Manage and automate cloud infrastructure using Terraform across AWS and OCI environments.
- Develop, maintain, and optimize CI/CD pipelines using Jenkins and Bitbucket.
- Deploy, manage, and monitor Kubernetes clusters for scalable workloads.
- Work with data engineers and data scientists to deploy, monitor, and maintain machine learning models in production.
- Automate ML workflows through CI/CD, ensuring seamless model deployment and integration.
- Implement model versioning, experiment tracking, and metadata management with tools such as MLflow or DVC.
- Ensure ML pipeline reproducibility, scalability, and reliability across environments.
- Monitor live model performance, data drift, and establish ML observability mechanisms.
- Maintain and support infrastructure for model training, validation, and inference.
- Troubleshoot and maintain AWS and OCI environments, ensuring stability and operational efficiency.
- Use Ansible for configuration management and automation tasks.
- Ensure availability, incident resolution, and system reliability through proactive monitoring.
- Implement cloud security best practices and compliance controls.
What skills required
- Hands-on experience working with AWS services including EC2, S3, IAM, RDS, SageMaker, and Step Functions.
- Strong expertise in Kubernetes cluster management and deployments.
- Proficiency in CI/CD pipelines using Jenkins, Bitbucket, or similar tools.
- Solid knowledge of Terraform for Infrastructure-as-Code automation.
- Python scripting experience for automation and MLOps workflows.
- Familiarity with MLOps orchestration platforms such as MLflow, Airflow, or Kubeflow.
- Good understanding of Linux administration and networking fundamentals.
- Ability to troubleshoot cloud infrastructure issues and resolve deployment challenges.
- Exposure to OCI, Ansible, and HashiCorp Nomad is an added advantage.