Job Description On-Prem Infrastructure Engineer / SRE
Location: Pan India
Experience: 510 Years
Role: On-Prem Infrastructure Engineer / Site Reliability Engineer (SRE)
Job Summary
We are seeking a skilled On-Prem Infrastructure Engineer / SRE to manage and support NVIDIA's on-prem engineering cloud infrastructure across multiple data centers. The ideal candidate will have strong experience in bare-metal infrastructure management, observability tools, automation, and production support. This role is critical in ensuring uptime, reliability, and operational excellence for engineering services.
Key Responsibilities
- On-Prem Infrastructure Management
Manage and operate NVIDIA's on-prem infrastructure across distributed data centers.
Maintain high availability, reliability, and readiness of on-prem engineering cloud environments.
Perform lifecycle management of bare-metal servers and underlying hardware.
Guard and maintain Service Level Agreements (SLAs) for mission-critical engineering services.
Implement and maintain monitoring, alerting, and incident response workflows.
Drive root cause analysis (RCA), conduct post-mortems, and ensure corrective and preventive actions.
- Observability & Monitoring
Deploy, configure, and manage observability tools such as
Prometheus, Grafana, ELK Stack.
Maintain KPI monitoring pipelines using
Jenkins, Python, and ELK.
Develop and enhance custom monitoring dashboards and business-specific alerting rules.
- Automation & Optimization
Contribute to capacity planning, resource optimization, and performance tuning initiatives.
Develop automation scripts/tools using
Python, Go, Bash, or Jenkins pipelines.
Improve operational efficiency through continuous automation.
- Day-to-Day Operations & Support
Monitor system alerts, troubleshoot incidents, and resolve user-reported issues.
Participate in
WAR rooms during major or high-impact incidents.
Ensure timely escalation and resolution of production issues.
- Collaboration & Documentation
Create and maintain technical documentation for operational procedures, architectures, and troubleshooting steps.
Work closely with engineering, DevOps, hardware, and data center teams to improve overall infrastructure reliability.
Required Skills & Experience
Strong hands-on experience in
bare-metal server management using tools such as:
IPMI, Redfish, KVM or similar technologies.
Experience With Automation And Scripting Using
Python, Go, Bash, Jenkins (CI/CD pipelines).
Practical Experience With Infrastructure Tools
Kubernetes, MySQL, Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana).
Solid understanding of system performance, capacity planning, and datacenter operations.
Strong troubleshooting, incident-response, and operational debugging skills.
Ability to work in fast-paced environments and handle production-critical scenarios.
Nice-to-Have Skills
Familiarity with
NVIDIA hardware: GPUs, Tegra systems, DGX platforms, etc.
Experience in large-scale distributed systems or high-performance computing environments.
Soft Skills
Strong communication and collaboration abilities.
Analytical mindset with a focus on problem-solving.
Ability to maintain composure under pressure in incident environments.
Detail-oriented with strong documentation habits.ocumentation habits.