Job Title: HPC Engineer AI Workloads & Infrastructure
Department: Operations High Performance Computing (HPC)
About iVedha:
iVedha is a leading provider of cloud and managed services, helping enterprises modernize their IT infrastructure and accelerate digital transformation. Be part of the Canada's next-generation sovereign AI infrastructure team, delivering high-performance computing solutions for AI, ML, and scientific workloads. Our mission is to empower innovators with secure, scalable, and sustainable compute platforms.
Role Overview:
We are seeking an HPC Engineer to join our operational team supporting AI workloads in a high-performance computing environment. This role focuses on building and managing HPC compute nodes, deploying Kubernetes clusters, and orchestrating bare-metal and virtualized environments. You will also work with advanced storage technologies such as VAST Data and MooseFS, ensuring seamless integration with GPU-accelerated infrastructure.
Key Responsibilities:
- Design, deploy, and maintain HPC clusters for AI/ML workloads, including GPU-accelerated compute nodes (NVIDIA DGX/HGX platforms).
- Implement and manage Kubernetes for containerized AI workloads, ensuring scalability and high availability.
- Configure and optimize bare-metal servers, VMs, and virtualized environments for HPC applications.
- Integrate and manage high-performance storage systems (VAST, MooseFS, Lustre, or similar parallel file systems).
- Implement job scheduling and orchestration using Slurm or equivalent tools for AI and HPC workloads.
- Monitor and tune system performance for GPU utilization, network throughput, and storage I/O.
- Automate deployment and configuration using Forman, Ansible, Terraform, or similar tools.
- Collaborate with AI engineers, DevOps, and data teams to optimize infrastructure for LLM training, fine-tuning, and inference pipelines.
- Ensure security, compliance, and data integrity across HPC environments.
Required Skills & Experience:
- 3+ years in HPC engineering, systems administration, or AI infrastructure roles.
- Strong experience with Linux (RHEL/CentOS/Ubuntu) in HPC environments.
- Hands-on experience with Kubernetes, Docker, and container orchestration for AI workloads.
- Familiarity with GPU clusters, CUDA, NCCL and NVIDIA ecosystem tools.
- Knowledge of high-speed interconnects (InfiniBand, RoCE) and networking for HPC.
- Experience with parallel/distributed file systems (VAST, MooseFS, Lustre, GPFS).
- Proficiency in automation and scripting (Python, Bash, Ansible).
- Understanding of job schedulers (Slurm, PBS, Torque) and workload optimization.
Nice-to-Have:
- Experience with cloud HPC platforms (Azure HPC, AWS ParallelCluster, or similar).
- Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and MLOps pipelines.
- Exposure to observability tools (Prometheus, Grafana) for HPC environments.
Why Join iVedha
- Work on cutting-edge AI infrastructure projects powering Canada's sovereign AI ecosystem.
- Collaborate with a world-class team of engineers and AI specialists.
- Competitive compensation, benefits, and opportunities for career growth in HPC and AI.