On-Prem Infrastructure Engineer / SRE

Natobotics

Pimpri, India

5-10 Years

This job is no longer accepting applications

Posted 3 months ago

Job Description

Job Description On-Prem Infrastructure Engineer / SRE

Location: Pan India

Experience: 510 Years

Role: On-Prem Infrastructure Engineer / Site Reliability Engineer (SRE)

Job Summary

We are seeking a skilled On-Prem Infrastructure Engineer / SRE to manage and support NVIDIA's on-prem engineering cloud infrastructure across multiple data centers. The ideal candidate will have strong experience in bare-metal infrastructure management, observability tools, automation, and production support. This role is critical in ensuring uptime, reliability, and operational excellence for engineering services.

Key Responsibilities

On-Prem Infrastructure Management

Manage and operate NVIDIA's on-prem infrastructure across distributed data centers.

Maintain high availability, reliability, and readiness of on-prem engineering cloud environments.

Perform lifecycle management of bare-metal servers and underlying hardware.

Service Level Management

Guard and maintain Service Level Agreements (SLAs) for mission-critical engineering services.

Implement and maintain monitoring, alerting, and incident response workflows.

Drive root cause analysis (RCA), conduct post-mortems, and ensure corrective and preventive actions.

Observability & Monitoring

Deploy, configure, and manage observability tools such as Prometheus, Grafana, ELK Stack.

Maintain KPI monitoring pipelines using Jenkins, Python, and ELK.

Develop and enhance custom monitoring dashboards and business-specific alerting rules.

Automation & Optimization

Contribute to capacity planning, resource optimization, and performance tuning initiatives.

Develop automation scripts/tools using Python, Go, Bash, or Jenkins pipelines.

Improve operational efficiency through continuous automation.

Day-to-Day Operations & Support

Monitor system alerts, troubleshoot incidents, and resolve user-reported issues.

Participate in WAR rooms during major or high-impact incidents.

Ensure timely escalation and resolution of production issues.

Collaboration & Documentation

Create and maintain technical documentation for operational procedures, architectures, and troubleshooting steps.

Work closely with engineering, DevOps, hardware, and data center teams to improve overall infrastructure reliability.

Required Skills & Experience

Strong hands-on experience in bare-metal server management using tools such as:

IPMI, Redfish, KVM or similar technologies.

Experience With Automation And Scripting Using

Python, Go, Bash, Jenkins (CI/CD pipelines).

Practical Experience With Infrastructure Tools

Kubernetes, MySQL, Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana).

Solid understanding of system performance, capacity planning, and datacenter operations.

Strong troubleshooting, incident-response, and operational debugging skills.

Ability to work in fast-paced environments and handle production-critical scenarios.

Nice-to-Have Skills

Familiarity with NVIDIA hardware: GPUs, Tegra systems, DGX platforms, etc.

Experience in large-scale distributed systems or high-performance computing environments.

Soft Skills

Strong communication and collaboration abilities.

Analytical mindset with a focus on problem-solving.

Ability to maintain composure under pressure in incident environments.

Detail-oriented with strong documentation habits.ocumentation habits.