HPC System Administration & Troubleshooting:
- Manage and optimize HPC clusters, ensuring high availability and performance.
- Troubleshoot GPU, CPU, network drivers, firmware, and OS-level issues.
- Debug storage, networking, and job scheduling bottlenecks in Slurm-based environments.
Kubernetes & Cloud HPC Environments:
- Deploy and manage HPC workloads in Kubernetes for AI/ML and parallel computing.
- Optimize OpenStack-based HPC clusters with Ceph, Cinder, and Neutron for cloud scalability.
- Implement containerized HPC workflows using Kubernetes and OpenShift.
Automation & Infrastructure as Code (IaC):
- Develop Ansible and Terraform scripts for provisioning and managing HPC resources.
- Automate job scheduling, cluster monitoring, and log analysis using Python.
- Optimize CI/CD pipelines for HPC and AI/ML applications.
Performance Tuning & Benchmarking:
- Benchmark and optimize multi-node HPC workloads (MPI, NCCL, ROCm, CUDA).
- Tune OS parameters, networking (InfiniBand, RoCE), and Slurm configurations for peak performance.
- Enhance HPC storage performance (Ceph, Lustre, NFS) and distributed computing efficiency.
Client Support & Collaboration:
- Provide real-time technical support and troubleshooting for HPC users.
- Engage with developers, DevOps, and system administrators to optimize cluster performance.
- Document solutions, best practices, and contribute to internal knowledge bases.
PREFERRED QUALIFICATIONS:
- Experience with AMD MI300, MI2X0 GPUs, ROCm, MPI, UCX, or XPMEM.
- Exposure to containerized workloads using Singularity or Docker in HPC.
- Familiarity with OpenStack deployment automation (e.g., TripleO, Kolla, or OpenStack-Ansible).
- Experience in customer-facing technical roles, with a strong ability to troubleshoot live issues.