Search by job, company or skills

Advanced Micro Devices (AMD)

SMTS Systems Design Eng.

4-8 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 17 hours ago
  • Be among the first 20 applicants
Early Applicant
Quick Apply

Job Description

We are seeking an experienced HPC Systems Engineer with 7+ years of expertise in high-performance computing (HPC) environments. This role requires hands-on experience with Python, Kubernetes (K8s), Slurm, OpenStack, and Ansible , along with the ability to support external clients in live troubleshooting sessions.

The PERSON:

The ideal candidate will have deep technical knowledge of drivers, troubleshooting methods, and system-level debugging and will play a key role in managing, optimizing, and troubleshooting HPC clusters and cloud-based HPC environments.

  • KEY RESPONSIBILITIES:
  • HPC System Administration & Troubleshooting Manage and optimize HPC clusters, ensuring high availability and performance.
  • Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues.
  • Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments.
  • Kubernetes & Cloud HPC Environments Deploy and manage HPC workloads in Kubernetes for AI/ML and parallel computing.
  • Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability.
  • Implement containerized HPC workflows using Kubernetes and OpenShift.
  • Automation & Infrastructure as Code (IaC)Develop Ansible and Terraform scripts for provisioning and managing HPC resources.
  • Automate job scheduling, cluster monitoring, and log analysis using Python.
  • Optimize CI/CD pipelines for HPC and AI/ML applications.
  • Performance Tuning & Benchmarking Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA).
  • Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance.
  • Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency.
  • Client Support & CollaborationProvide real-time technical support and troubleshooting for HPC users.
  • Engage with developers, DevOps, and system administrators to optimize cluster performance.
  • Document solutions, best practices, and contribute to internal knowledge bases.

PREFERRED QUALIFICATION'S

  • Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM.
  • Exposure to containerized workloads using Singularity or Docker in HPC.
  • Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible).
  • Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues.

More Info

Job Type:
Function:
Employment Type:
Open to candidates from:
Indian

About Company

For nearly 50 years, AMD (NASDAQ: AMD) has driven innovation in high-performance computing, graphics, and visualization technologies the building blocks for gaming, immersive platforms, and the datacenter. Hundreds of millions of consumers, leading Fortune 500 businesses, and cutting-edge scientific research facilities around the world rely on AMD technology daily to improve how they live, work, and play.

Job ID: 122723673

Similar Jobs