Search by job, company or skills

iVedha Inc.

HPC Engineer - AI Workloads & Infrastructure

3-5 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Title: HPC Engineer AI Workloads & Infrastructure

Department: Operations High Performance Computing (HPC)

About iVedha:

iVedha is a leading provider of cloud and managed services, helping enterprises modernize their IT infrastructure and accelerate digital transformation. Be part of the Canada's next-generation sovereign AI infrastructure team, delivering high-performance computing solutions for AI, ML, and scientific workloads. Our mission is to empower innovators with secure, scalable, and sustainable compute platforms.

Role Overview:

We are seeking an HPC Engineer to join our operational team supporting AI workloads in a high-performance computing environment. This role focuses on building and managing HPC compute nodes, deploying Kubernetes clusters, and orchestrating bare-metal and virtualized environments. You will also work with advanced storage technologies such as VAST Data and MooseFS, ensuring seamless integration with GPU-accelerated infrastructure.

Key Responsibilities:

  • Design, deploy, and maintain HPC clusters for AI/ML workloads, including GPU-accelerated compute nodes (NVIDIA DGX/HGX platforms).
  • Implement and manage Kubernetes for containerized AI workloads, ensuring scalability and high availability.
  • Configure and optimize bare-metal servers, VMs, and virtualized environments for HPC applications.
  • Integrate and manage high-performance storage systems (VAST, MooseFS, Lustre, or similar parallel file systems).
  • Implement job scheduling and orchestration using Slurm or equivalent tools for AI and HPC workloads.
  • Monitor and tune system performance for GPU utilization, network throughput, and storage I/O.
  • Automate deployment and configuration using Forman, Ansible, Terraform, or similar tools.
  • Collaborate with AI engineers, DevOps, and data teams to optimize infrastructure for LLM training, fine-tuning, and inference pipelines.
  • Ensure security, compliance, and data integrity across HPC environments.

Required Skills & Experience:

  • 3+ years in HPC engineering, systems administration, or AI infrastructure roles.
  • Strong experience with Linux (RHEL/CentOS/Ubuntu) in HPC environments.
  • Hands-on experience with Kubernetes, Docker, and container orchestration for AI workloads.
  • Familiarity with GPU clusters, CUDA, NCCL and NVIDIA ecosystem tools.
  • Knowledge of high-speed interconnects (InfiniBand, RoCE) and networking for HPC.
  • Experience with parallel/distributed file systems (VAST, MooseFS, Lustre, GPFS).
  • Proficiency in automation and scripting (Python, Bash, Ansible).
  • Understanding of job schedulers (Slurm, PBS, Torque) and workload optimization.

Nice-to-Have:

  • Experience with cloud HPC platforms (Azure HPC, AWS ParallelCluster, or similar).
  • Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and MLOps pipelines.
  • Exposure to observability tools (Prometheus, Grafana) for HPC environments.

Why Join iVedha

  • Work on cutting-edge AI infrastructure projects powering Canada's sovereign AI ecosystem.
  • Collaborate with a world-class team of engineers and AI specialists.
  • Competitive compensation, benefits, and opportunities for career growth in HPC and AI.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 137803825