Role & Responsibilities
- Design, deploy, and automate scalable GPU cluster infrastructure across bare-metal and hybrid-cloud environments (K8s, Slurm, etc.).
- Implement robust observability pipelines using Prometheus, Grafana, and custom exporters to monitor GPU utilization, memory pressure, and job failures.
- Build self-healing mechanisms and chaos engineering practices to proactively detect and recover from GPU node failures or driver instability.
- Collaborate with ML Platforms and Infrastructure teams to optimize containerized GPU workloads for throughput, cost-efficiency, and SLA compliance.
- Develop and enforce IaC standards using Terraform, Ansible, or Pulumi for reproducible, version-controlled GPU infrastructure provisioning.
- Lead incident response for GPU system outages; conduct post-mortems and drive reliability improvements across the stack.
Skills & Qualifications
Must-Have
- Kubernetes
- Prometheus
- Grafana
- Terraform
- Ansible
- Linux System Administration
- NVIDIA GPU Driver Management
- Slurm
Preferred
- Chaos Engineering (Gremlin, Litmus)
- GPU profiling tools (Nsight, DCGM)
- Experience with AI/ML workload orchestration (Kubeflow, Ray, MLflow)
Benefits & Culture Highlights
- Work with bleeding-edge GPU stacks powering next-gen AI models—directly shaping infrastructure that impacts global AI innovation.
- High-ownership culture with autonomy to architect and own critical reliability systems from design to deployment.
- Collaborative, engineering-first environment with strong mentorship, peer code reviews, and dedicated SRE guilds for cross-team learning.
Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code