We're hiring a Senior Platform Engineer - ML Infrastructure in the AI/ML infrastructure and deep-tech industry. We're seeking an experienced engineer to join our core infrastructure team. This role will be critical in designing and scaling the foundational systems that power AI products. If you're passionate about building robust, efficient, and innovative ML infrastructure, we'd love to hear from you.
Responsibilities
- Design, build, and operate scalable ML and data infrastructure across on-prem and cloud (AWS/Azure/GCP).
- Standup and automate multinode Kubernetes + GPU clusters; keep them healthy and cost-efficient.
- Create goldenpath CI/CD and MLOps pipelines (Kubeflow/Flyte/Ray) for training, serving, RAG, and agentic workflows.
- Partner with ML engineers to debug thorny CUDA/K8s issues before they hit prod.
- Champion IaC (Terraform/Pulumi) and config-as-code (Ansible) standards.
- Mentor developers on platform best practices and drive a platform-first mindset.
Requirements
- 5+ years DevOps/SRE/Platform engineering; 2+ with ML infra at scale.
- Deep hands-on with Docker, Kubernetes, Helm, and kubenative tooling.
- Comfort with distributed GPU scheduling, CUDA drivers, and networking.
- Strong Terraform/Pulumi, Ansible, Bash/Python skills.
- Experience operating data lakes, high-availability databases, and object stores.
- Familiarity with ML orchestration (Kubeflow, Flyte, Prefect) and model registries.
- Working knowledge of RAG, LLM fine-tuning, or agentic frameworks is a big plus.
- Nice to have Experience with Ray, Spark, or Dask.
- Security and RBAC design chops.
- OSS contributions in the cloud-native / MLOps space.
This job was posted by Harjas Singh from FullThrottle Labs.