Client Provisioning: Execute the deployment of high-performance storage clients (VAST, Weka, GPFS/Spectrum Scale, Lustre) on bare-metal DGX/HGX nodes using Ansible.
Protocol Configuration: Configure and tune RDMA-based protocols (NVMe-oF, NFS over RDMA, GPUDirect Storage) to bypass the CPU and deliver data directly to GPU memory.
Kubernetes Integration: Install and troubleshoot CSI (Container Storage Interface) drivers to ensure dynamic provisioning of Persistent Volumes (PVs) for AI workloads running in K8s.
Mount Management: Manage complex mount maps and automounter configurations to ensure consistent namespace views across thousands of compute nodes. 2. Validation & Performance Benchmarking
Throughput Testing: Execute standard I/O benchmarks to validate that the storage subsystem meets the Gold Standard read/write targets (e.g., 400GB/s read throughput).
Latency Tuning: Tune client-side kernel parameters (read-ahead buffers, queue depths, sysctl settings) to minimize latency for small-file random I/O patterns common in checkpointing.
Acceptance Reporting: Generate As-Built storage validation reports, documenting effective throughput and IOPS for client sign-off. 3. Operations & Support
Capacity & Quotas: Implement project-level quotas and monitor usage trends to prevent Disk Full outages on critical scratch filesystems.
Ticket Resolution: Handle L2 support tickets for storage issues, such as Stale file handles, Slow dataset loading, or CSI Driver crashes.
Lifecycle Management: Execute non-disruptive client-side driver upgrades and firmware patches during maintenance windows.,
Technical Competencies
Essential Skills
High-Performance Storage:
Parallel Filesystems: Hands-on operational experience with at least one major AI storage platform: VAST Data, Weka.io, DDN Lustre (Exascaler), or IBM GPFS (Spectrum Scale).
Linux I/O Stack: Deep understanding of the Linux VFS (Virtual File System), block devices, and how to debug I/O performance using tools like iostat, iotop, and strace.
RDMA Storage: Experience configuring NVMe-over-Fabrics (NVMe-oF) or NFS-over-RDMA, understanding the dependency on the underlying InfiniBand/RoCE network.
Automation & Containerisation:
Ansible Storage: Proficiency in writing Ansible playbooks to automate the installation of storage clients and configuration of mount points.
Kubernetes Storage: Understanding of StorageClasses, PVCs, and how to debug CSI Driver pods (checking logs for mount failures).
GPUDirect: Conceptual understanding of NVIDIA GPUDirect Storage (GDS) and the ability to verify if GDS is active.
Desirable Experience
Vendor Specifics: Deep certification or experience with Pure Storage (FlashBlade) or NetApp ONTAP AI configurations.
Object Storage: Experience interacting with S3-compatible object stores via CLI for model weight retrieval.
Data Migration: Experience using tools like fpsync or rclone to move petabyte-scale datasets between tiers.
Certifications
Highly Desirable:
NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO)
Vendor Certifications:
VAST Certified Administrator (VCP-AD1)
WEKA Technical Xpert Certification
Red Hat Certified Specialist in Storage Administration
Success Metrics (KPIs)
I/O Performance: Achieving >95% of the theoretical line-rate throughput on IOR/FIO benchmarks for provisioned clients.
Mount Stability: Zero Stale File Handles or disconnected mounts across the cluster during the 72-hour burn-in period.
Ticket Velocity: Consistently meeting SLAs for storage-related support tickets.