Search by job, company or skills

Nava

Site Reliability Engineer (SRE) – GPU Infrastructure

This job is no longer accepting applications

  • Posted 25 days ago

Job Description

Role & Responsibilities

  • Design, deploy, and maintain GPU-accelerated infrastructure on Kubernetes (EKS/GKE/AKS) and bare-metal clusters with NVIDIA GPU operators.
  • Automate deployment, scaling, and failover of AI workloads using Terraform, Ansible, and CI/CD pipelines (GitLab CI, ArgoCD).
  • Implement observability with Prometheus, Grafana, and distributed tracing to monitor GPU utilization, memory, and job latency.
  • Troubleshoot GPU driver, CUDA runtime, and container orchestration issues across multi-cluster, multi-region environments.
  • Collaborate with ML engineers to optimize job scheduling, resource isolation, and node affinity for high-throughput GPU training/inference.
  • Define and enforce SLOs/SLIs for AI infrastructure, automate on-call playbooks, and drive incident post-mortems to eliminate recurring failures.

Skills & Qualifications

  • Must-Have
  • Kubernetes
  • Prometheus
  • Grafana
  • Terraform
  • Ansible
  • NVIDIA GPU Operator
  • CUDA
  • GitLab CI
  • Preferred
  • ArgoCD
  • Slack/Opsgenie alerting
  • GPU profiling tools (Nsight, DCGM)

Benefits & Culture Highlights

  • Work directly on bleeding-edge AI infrastructure powering global LLM and HPC workloads.
  • On-site collaboration with deep-tech AI/ML engineers in a high-velocity, outcome-driven culture.
  • Ownership to architect and scale infrastructure—no red tape, just impact.

Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148569473

Similar Jobs

Bengaluru, India

Skills:

TerraformAnsiblePrometheusGrafanaLinux System AdministrationKubernetesNVIDIA GPU Driver ManagementChaos EngineeringAI ML workload orchestrationSlurmGPU profiling tools