We are seeking a highly skilledLead DevOps Engineerwith strongOn-Premise infrastructureexpertise to join our team and drive the end-to-end deployment, scalability, and operationalization of machine learning models in production. You will collaborate closely with data scientists, data engineers, and DevOps teams to ensure seamless CI/CD, reproducibility, monitoring, and governance of ML pipelines.
Key Responsibilities
- Design, implement, and maintain CI/CD pipelines for deploying and monitoring microservices efficiently in on-premise environments.
- Manage infrastructure as code using Terraform (or equivalent on-prem solutions) for repeatable and scalable provisioning.
- Deploy and optimize containerized applications using Docker across on-premise environments, integrating with systems such as Harbor (or other private registries), Vault, and on-prem messaging/file storage solutions.
- Apply best practices for securing Docker images, including vulnerability scanning, reducing image size, and optimizing build efficiency.
- Implement and maintain centralized logging, monitoring, and alerting systems (e.g., Prometheus, Grafana, ELK stack) to ensure system reliability and observability.
- Ensure security best practices across on-prem environments, including secrets management, access control, and compliance with organizational policies.
- (Nice to have)
- Design and manage multi-client architectures within shared pipelines and storage solutions (e.g., NFS, Object Storage).