HPC Engineer - AI Workloads & Infrastructure

iVedha Inc.

India

3-5 Years

Save

Posted 2 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Title: HPC Engineer AI Workloads & Infrastructure

Department: Operations High Performance Computing (HPC)

About iVedha:

iVedha is a leading provider of cloud and managed services, helping enterprises modernize their IT infrastructure and accelerate digital transformation. Be part of the Canada's next-generation sovereign AI infrastructure team, delivering high-performance computing solutions for AI, ML, and scientific workloads. Our mission is to empower innovators with secure, scalable, and sustainable compute platforms.

Role Overview:

We are seeking an HPC Engineer to join our operational team supporting AI workloads in a high-performance computing environment. This role focuses on building and managing HPC compute nodes, deploying Kubernetes clusters, and orchestrating bare-metal and virtualized environments. You will also work with advanced storage technologies such as VAST Data and MooseFS, ensuring seamless integration with GPU-accelerated infrastructure.

Key Responsibilities:

Design, deploy, and maintain HPC clusters for AI/ML workloads, including GPU-accelerated compute nodes (NVIDIA DGX/HGX platforms).
Implement and manage Kubernetes for containerized AI workloads, ensuring scalability and high availability.
Configure and optimize bare-metal servers, VMs, and virtualized environments for HPC applications.
Integrate and manage high-performance storage systems (VAST, MooseFS, Lustre, or similar parallel file systems).
Implement job scheduling and orchestration using Slurm or equivalent tools for AI and HPC workloads.
Monitor and tune system performance for GPU utilization, network throughput, and storage I/O.
Automate deployment and configuration using Forman, Ansible, Terraform, or similar tools.
Collaborate with AI engineers, DevOps, and data teams to optimize infrastructure for LLM training, fine-tuning, and inference pipelines.
Ensure security, compliance, and data integrity across HPC environments.

Required Skills & Experience:

3+ years in HPC engineering, systems administration, or AI infrastructure roles.
Strong experience with Linux (RHEL/CentOS/Ubuntu) in HPC environments.
Hands-on experience with Kubernetes, Docker, and container orchestration for AI workloads.
Familiarity with GPU clusters, CUDA, NCCL and NVIDIA ecosystem tools.
Knowledge of high-speed interconnects (InfiniBand, RoCE) and networking for HPC.
Experience with parallel/distributed file systems (VAST, MooseFS, Lustre, GPFS).
Proficiency in automation and scripting (Python, Bash, Ansible).
Understanding of job schedulers (Slurm, PBS, Torque) and workload optimization.

Nice-to-Have:

Experience with cloud HPC platforms (Azure HPC, AWS ParallelCluster, or similar).
Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and MLOps pipelines.
Exposure to observability tools (Prometheus, Grafana) for HPC environments.

Why Join iVedha

Work on cutting-edge AI infrastructure projects powering Canada's sovereign AI ecosystem.
Collaborate with a world-class team of engineers and AI specialists.
Competitive compensation, benefits, and opportunities for career growth in HPC and AI.