Search by job, company or skills

Advanced Micro Devices (AMD)

SMTS Systems Design Eng.

4-8 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant
Quick Apply

Job Description

HPC System Administration & Troubleshooting:

  • Manage and optimize HPC clusters, ensuring high availability and performance.
  • Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues.
  • Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments.

Kubernetes & Cloud HPC Environments:

  • Deploy and manage HPC workloads in Kubernetes for AI/ML and parallel computing.
  • Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability.
  • Implement containerized HPC workflows using Kubernetes and OpenShift.

Automation & Infrastructure as Code (IaC):

  • Develop Ansible and Terraform scripts for provisioning and managing HPC resources.
  • Automate job scheduling, cluster monitoring, and log analysis using Python.
  • Optimize CI/CD pipelines for HPC and AI/ML applications.

Performance Tuning & Benchmarking:

  • Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA).
  • Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance.
  • Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency.

Client Support & Collaboration:

  • Provide real-time technical support and troubleshooting for HPC users.
  • Engage with developers, DevOps, and system administrators to optimize cluster performance.
  • Document solutions, best practices, and contribute to internal knowledge bases.

PREFERRED QUALIFICATIONS:

  • Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM.
  • Exposure to containerized workloads using Singularity or Docker in HPC.
  • Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible).
  • Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues.

About Company

For nearly 50 years, AMD (NASDAQ: AMD) has driven innovation in high-performance computing, graphics, and visualization technologies the building blocks for gaming, immersive platforms, and the datacenter. Hundreds of millions of consumers, leading Fortune 500 businesses, and cutting-edge scientific research facilities around the world rely on AMD technology daily to improve how they live, work, and play.

Job ID: 114472603

Similar Jobs