Search by job, company or skills

Tesla

Site Reliability Engineer, HPC / AI Infrastructure

Save
new job description bg glownew job description bg glow
  • Posted 23 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

What To Expect
Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As the scope and impact of our Optimus, Full-Self-Driving (FSD) & Robotaxi efforts continue to scale, so does the value of this team and its work.

As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our Full-Self-Driving (FSD) & Optimus engineering teams have the necessary tools and resources to be productive. This includes managing/operating our AI infrastructure, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and security. Your work will directly facilitate neural network training at scale & streamline FSD development.

What You'll Do

  • Support the AI/ML cluster infrastructure on GPU platforms, focusing on systems automation, configuration management and deployment at scale.
  • Improve our monitoring & self-healing pipelines, as well as security posture.
  • Optimize our server, storage and network performance.
  • Develop new tools in Python, Golang or Bash/Shell.
  • Use Infrastructure as Code best practices.
  • Participate in 24x7 on-call rotation.

  • What You'll Bring


  • Proficiency with Linux fundamentals and performance optimizations.
  • Experience with Slurm, LSF and storage management of parallel file systems.
  • Proficiency in Python, Golang and/or Bash.
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.).
  • Experience with containerization technologies such as Kubernetes.
  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus.
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field.
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position.
  • More Info

    Job Type:
    Industry:
    Employment Type:

    About Company

    Job ID: 148680871

    Similar Jobs

    Bengaluru, India

    Skills:

    RoutersSwitchesPrometheusRoutingBashVpcGrafanaDatadogLoad BalancersCloudwatchTerraformPythonRoute TablesDNS Route 53Firewall concepts

    Bengaluru, India

    Skills:

    ElkPrometheusNode.jsGrafanaDatadogGcpDockerTerraformLinuxAnsibleAzureKubernetesPythonAWSEFKSite Reliability Engineering

    Bengaluru, India

    Skills:

    JavaReactJasperreportsTerraformMavenBashIbm CognosSqlPythonGitHub ActionsLooker

    Bengaluru, India

    Skills:

    CloudformationPrometheusPulumiGrafanaDatadogJenkinsDockerTerraformLinuxAnsibleAWS IAMPuppetKubernetesPythonAWSChefGoEKSGitLab CIGitHub Actions

    Bengaluru, India

    Skills:

    JavaS3GithubCloud ServicesCloudformationDockerDevopsEc2LinuxPerlTerraformMySQLECSMongoDBRubyRestful ApisAmazon RdsHelmKubernetesPythonAWSGo LangEKS