Search by job, company or skills

W

Senior HPC Engineer

Fresher
new job description bg glownew job description bg glownew job description bg svg
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Title: Senior HPC Engineer

Reports To: Domain Architect - AI Compute

Location: Remote, India (Must align with Client Time Zone)

Employment Type: Full-Time

About The Role

The Senior HPC Engineer is the Foreman of the AI Factory. While the Domain Architect defines the architectural vision, you are responsible for the hands-on build and deployment. You act as the Technical Squad Lead for our offshore engineering teams, bridging the gap between the onshore architectural vision and the hands-on execution.

As a System Integrator, we thrive on velocity and precision. You will not just maintain clusters; you will lead the automated deployment of NVIDIA SuperPOD and BasePOD infrastructure for global enterprise clients. You are the Lieutenant to the Domain Architect, translating High-Level Designs (HLDs) into executable Ansible playbooks and ensuring your squad of HPC Engineers delivers defect-free infrastructure.

In this role, you are 100% Delivery-Focused, split between Technical Leadership (40%) and Hands-on Engineering (60%). You are the escalation point for complex kernel panics, the guardian of our Infrastructure-as-Code (IaC) repository, and the mentor who unblocks junior engineers when a Slurm job fails to schedule.

CRITICAL REQUIREMENT: This role typically operates on Shift Hours to align with the onshore client's time zone (e.g., early shifts for Australian clients, or split shifts for European clients).

Key Responsibilities

  • Hands-on Engineering & Automation (60%)
  • Cluster Provisioning Factory:
  • Lead the deployment of NVIDIA Base Command Manager (BCM) (formerly Bright Cluster Manager) to provision bare-metal DGX/HGX nodes at scale.
  • Develop and maintain the Ansible / Terraform library used to configure OS settings, user authentication (LDAP/AD), and storage mounts across hundreds of nodes.
  • Execute HPL (High-Performance Linpack) and NCCL-tests to validate cluster performance, tuning BIOS and OS parameters to hit Gold Standard benchmarks.
  • Scheduler & Workload Orchestration:
  • Configure complex Slurm Workload Manager policies, including Fair Share, Preemption, and GPU Partitioning (MIG).
  • Integrate Kubernetes-based orchestrators (e.g., NVIDIA Base Command, Run:AI, or Red Hat OpenShift) with the underlying HPC hardware.
  • Deep-Dive Troubleshooting:
  • Debug Silent Data Corruption and Xid Errors on GPUs, analysing nvidia-smi logs and kernel message buffers (dmesg).
  • Diagnose fabric-related performance drops (e.g., Identifying a specific flapping link causing global slowdowns) in collaboration with the Network Squad.
  • Squad Leadership & Quality Assurance (40%)
  • Technical Direction (The Foreman):
  • Translate the Low-Level Design (LLD) provided by the Domain Architect into granular Jira tasks for your squad of HPC Engineers.
  • Conduct daily stand-ups to unblock engineers, clarifying requirements and making technical decisions on the fly (e.g., Use Ansible roles, not shell scripts for this).
  • Code Quality & Governance:
  • Act as the Primary Gatekeeper for the code repository. Perform mandatory Code Reviews on all Pull Requests (PRs) to ensure idempotency and error handling.
  • Enforce Config-as-Code discipline, ensuring no manual changes are made to production clusters without a committed playbook.
  • Mentorship:
  • Guide mid-level and junior engineers on best practices for Linux Systems Administration and HPC environments.

,

Technical Competencies

Essential Skills

  • HPC & AI Infrastructure:
  • Expert-level knowledge of NVIDIA Base Command Manager (BCM) or Metal-as-a-Service (MaaS) provisioning tools.
  • Deep understanding of Slurm configuration (cgroups, plugin development, accounting).
  • Proficiency with NVIDIA DGX/HGX hardware architecture and the associated software stack (Drivers, CUDA, DCGM).
  • Linux & Automation (DevOps for Hardware):
  • Mastery of Red Hat Enterprise Linux (RHEL) / Ubuntu internals (Systemd, Kernel Tuning, Hugepages).
  • Advanced proficiency in Ansible (writing custom modules/roles) and Python (automating admin tasks).
  • Experience with Git workflows (Branching, PRs, CI/CD).
  • Containerisation:
  • Hands-on experience with Docker, Singularity/Apptainer, and Kubernetes (specifically NVIDIA GPU Operator and NVIDIA Network Operator).

Desirable Experience

  • Network Awareness: Ability to troubleshoot basic InfiniBand/RoCEv2 issues (ibstat, perf query) to distinguish between a Node Issue and a Network Issue.
  • Storage Integration: Experience mounting high-performance parallel file systems (VAST/Lustre/WEKA/GPFS) and tuning client-side performance.
  • Certifications:
  • NVIDIA Certified Associate - AI in the Data Center.
  • Red Hat Certified Engineer (RHCE).
  • CKA (Certified Kubernetes Administrator).

Success Metrics (KPIs)

  • Deployment Velocity: Reduction in Time-to-Hello-World (time from power-on to running the first successful GPU job) for new clusters.
  • Code Quality: >95% of Pull Requests pass automated linting and require fewer than 2 review cycles before merge.
  • Stability: Zero Configuration Drift incidents in production (e.g., manual changes breaking the cluster) due to strict IaC enforcement.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144906537