Site Reliability Engineer (SRE) – GPU Infrastructure

Nava

Bengaluru, India

Fresher

This job is no longer accepting applications

Posted 25 days ago

Job Description

Role & Responsibilities

Design, deploy, and maintain GPU-accelerated infrastructure on Kubernetes (EKS/GKE/AKS) and bare-metal clusters with NVIDIA GPU operators.
Automate deployment, scaling, and failover of AI workloads using Terraform, Ansible, and CI/CD pipelines (GitLab CI, ArgoCD).
Implement observability with Prometheus, Grafana, and distributed tracing to monitor GPU utilization, memory, and job latency.
Troubleshoot GPU driver, CUDA runtime, and container orchestration issues across multi-cluster, multi-region environments.
Collaborate with ML engineers to optimize job scheduling, resource isolation, and node affinity for high-throughput GPU training/inference.
Define and enforce SLOs/SLIs for AI infrastructure, automate on-call playbooks, and drive incident post-mortems to eliminate recurring failures.

Skills & Qualifications

Must-Have
Kubernetes
Prometheus
Grafana
Terraform
Ansible
NVIDIA GPU Operator
CUDA
GitLab CI
Preferred
ArgoCD
Slack/Opsgenie alerting
GPU profiling tools (Nsight, DCGM)

Benefits & Culture Highlights

Work directly on bleeding-edge AI infrastructure powering global LLM and HPC workloads.
On-site collaboration with deep-tech AI/ML engineers in a high-velocity, outcome-driven culture.
Ownership to architect and scale infrastructure—no red tape, just impact.

Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code

More Info

Job Type:

Industry:

Function:

Employment Type:

About Company

NavaJob Source: www.linkedin.com

Job ID: 148569473

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 09-06-2026 11:18:12 PM

Homejobs in Bengaluru / BangaloreSite Reliability Engineer (SRE) – GPU Infrastructure

Similar Jobs

Site Reliability Engineer (SRE) GPU Infrastructure

Nava

Bengaluru, India

Skills:

Terraform, Ansible, Prometheus, Grafana, Linux System Administration, Kubernetes, NVIDIA GPU Driver Management, Chaos Engineering, AI ML workload orchestration, Slurm, GPU profiling tools