We are seeking an experienced HPC Systems Engineer with 7+ years of expertise in high-performance computing (HPC) environments. This role requires hands-on experience with Python, Kubernetes (K8s), Slurm, OpenStack, and Ansible , along with the ability to support external clients in live troubleshooting sessions.
The PERSON:
The ideal candidate will have deep technical knowledge of drivers, troubleshooting methods, and system-level debugging and will play a key role in managing, optimizing, and troubleshooting HPC clusters and cloud-based HPC environments.
- KEY RESPONSIBILITIES:
- HPC System Administration & Troubleshooting Manage and optimize HPC clusters, ensuring high availability and performance.
- Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues.
- Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments.
- Kubernetes & Cloud HPC Environments Deploy and manage HPC workloads in Kubernetes for AI/ML and parallel computing.
- Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability.
- Implement containerized HPC workflows using Kubernetes and OpenShift.
- Automation & Infrastructure as Code (IaC)Develop Ansible and Terraform scripts for provisioning and managing HPC resources.
- Automate job scheduling, cluster monitoring, and log analysis using Python.
- Optimize CI/CD pipelines for HPC and AI/ML applications.
- Performance Tuning & Benchmarking Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA).
- Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance.
- Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency.
- Client Support & CollaborationProvide real-time technical support and troubleshooting for HPC users.
- Engage with developers, DevOps, and system administrators to optimize cluster performance.
- Document solutions, best practices, and contribute to internal knowledge bases.
PREFERRED QUALIFICATION'S
- Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM.
- Exposure to containerized workloads using Singularity or Docker in HPC.
- Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible).
- Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues.