Search by job, company or skills

Nava

Site Reliability Engineer (SRE) GPU Infrastructure

Save
  • Posted 8 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role & Responsibilities

  • Design, deploy, and automate scalable GPU cluster infrastructure across bare-metal and hybrid-cloud environments (K8s, Slurm, etc.).
  • Implement robust observability pipelines using Prometheus, Grafana, and custom exporters to monitor GPU utilization, memory pressure, and job failures.
  • Build self-healing mechanisms and chaos engineering practices to proactively detect and recover from GPU node failures or driver instability.
  • Collaborate with ML Platforms and Infrastructure teams to optimize containerized GPU workloads for throughput, cost-efficiency, and SLA compliance.
  • Develop and enforce IaC standards using Terraform, Ansible, or Pulumi for reproducible, version-controlled GPU infrastructure provisioning.
  • Lead incident response for GPU system outages; conduct post-mortems and drive reliability improvements across the stack.

Skills & Qualifications

Must-Have

  • Kubernetes
  • Prometheus
  • Grafana
  • Terraform
  • Ansible
  • Linux System Administration
  • NVIDIA GPU Driver Management
  • Slurm

Preferred

  • Chaos Engineering (Gremlin, Litmus)
  • GPU profiling tools (Nsight, DCGM)
  • Experience with AI/ML workload orchestration (Kubeflow, Ray, MLflow)

Benefits & Culture Highlights

  • Work with bleeding-edge GPU stacks powering next-gen AI models—directly shaping infrastructure that impacts global AI innovation.
  • High-ownership culture with autonomy to architect and own critical reliability systems from design to deployment.
  • Collaborative, engineering-first environment with strong mentorship, peer code reviews, and dedicated SRE guilds for cross-team learning.

Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 149075255