Search by job, company or skills

Natobotics

Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 13 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are hiring Site Reliability Engineer for Hyderabad, Pune, Chennai and Bangalore locations

Immediate Joiners

Exp : 6 + yrs of experience

Candidates who has more than 6plus of years of experience only need apply

On-prem infrastructure management

Manage Nvidia's on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.

  • Guard SLAs

Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches.

  • Observability

Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK.

Improve monitoring systems by adding custom alerts based on business needs.

  • Automation & Optimization

Help in capacity planning, optimization and better utilization efforts.

  • Day-to-Day Support

Support user reported issues & issues. Monitor alerts and take necessary action.

Actively participate in WAR room for critical issues

  • Collaboration & Documentation

Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.

  • Tech stack

Baremetal data center machine management tools like IPMI, Redfish, KVM etc.

Automation using Jenkins, Python, Go, Bash.

Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.

Any familiarity with Nvidia hardware like GPU & Tegras is a plus

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 134323085

Similar Jobs