Search Jobs

Search by job, company or skills

Nava

Site Reliability Engineer (SRE) GPU Infrastructure

Bengaluru, India

Fresher

Save

Posted 8 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Role & Responsibilities

Design, deploy, and automate scalable GPU cluster infrastructure across bare-metal and hybrid-cloud environments (K8s, Slurm, etc.).
Implement robust observability pipelines using Prometheus, Grafana, and custom exporters to monitor GPU utilization, memory pressure, and job failures.
Build self-healing mechanisms and chaos engineering practices to proactively detect and recover from GPU node failures or driver instability.
Collaborate with ML Platforms and Infrastructure teams to optimize containerized GPU workloads for throughput, cost-efficiency, and SLA compliance.
Develop and enforce IaC standards using Terraform, Ansible, or Pulumi for reproducible, version-controlled GPU infrastructure provisioning.
Lead incident response for GPU system outages; conduct post-mortems and drive reliability improvements across the stack.

Skills & Qualifications

Must-Have

Kubernetes
Prometheus
Grafana
Terraform
Ansible
Linux System Administration
NVIDIA GPU Driver Management
Slurm

Preferred

Chaos Engineering (Gremlin, Litmus)
GPU profiling tools (Nsight, DCGM)
Experience with AI/ML workload orchestration (Kubeflow, Ray, MLflow)

Benefits & Culture Highlights

Work with bleeding-edge GPU stacks powering next-gen AI models—directly shaping infrastructure that impacts global AI innovation.
High-ownership culture with autonomy to architect and own critical reliability systems from design to deployment.
Collaborative, engineering-first environment with strong mentorship, peer code reviews, and dedicated SRE guilds for cross-team learning.

Skills: nvidia,ml,platforms,automation,building,infrastructure,cloud,teams,reliability,gpu,code

More Info

Job Type:

Industry:

Function:

Employment Type:

About Company

NavaJob Source: www.linkedin.com

Job ID: 149075255

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 11-06-2026 01:49:16 PM

Homejobs in Bengaluru / BangaloreSite Reliability Engineer (SRE) GPU Infrastructure