Search by job, company or skills

World Wide Technology

HPC Engineer - Storage

Fresher
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 17 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

  • Storage Integration & Client Configuration
  • Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare-metal DGX/HGX nodes using Ansible.
  • Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory.
  • Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s.
  • Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes. 2. Validation & Performance Benchmarking
  • Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the Gold Standard read/write targets (e.g., 400GB/s read throughput).
  • Latency Tuning: Tune client-side kernel parameters (read-ahead buffers, queue depths, sysctl settings) to minimize latency for small-file random I/O patterns common in checkpointing.
  • Acceptance Reporting: Generate As-Built storage validation reports, documenting effective throughput and IOPS for client sign-off. 3. Operations & Support
  • Capacity & Quotas: Implement project-level quotas and monitor usage trends to prevent Disk Full outages on critical scratch filesystems.
  • Ticket Resolution: Handle L2 support tickets for storage issues, such as Stale file handles, Slow dataset loading, or CSI Driver crashes.
  • Lifecycle Management: Execute non-disruptive client-side driver upgrades and firmware patches during maintenance windows.,

Technical Competencies

Essential Skills

High-Performance Storage:

  • Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
  • Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
  • RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.

Automation & Containerisation:

  • Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
  • Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
  • GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.

Desirable Experience

  • Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
  • Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
  • Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.

Certifications

Highly Desirable:

  • NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
  • Vendor Certifications:
    • VAST Certified Administrator (VCP-AD1)
    • WEKA Technical Xpert Certification
  • Red Hat Certified Specialist in Storage Administration

Success Metrics (KPIs)

  • I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
  • Mount Stability: Zero Stale File Handles or disconnected mounts across the cluster during the 72-hour burn-in period.
  • Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 146828787

Similar Jobs