Search by job, company or skills

aptly technology corporation

InfiniBand Engineer (High-Performance Networking)

5-7 Years

This job is no longer accepting applications

new job description bg glownew job description bg glownew job description bg svg
  • Posted a month ago

Job Description

Job Summary

We are seeking a highly skilled InfiniBand Engineer with strong expertise in advanced networking technologies to design, deploy, and support high-performance, low-latency network infrastructures. The ideal candidate will have hands-on experience with InfiniBand fabrics, data center networking, and large-scale distributed computing environments (HPC / AI / ML clusters).

Key Responsibilities

  • Design, implement, and manage large-scale InfiniBand (IB) fabrics in data center and HPC environments.
  • Configure and troubleshoot InfiniBand switches and adapters (e.g., Mellanox / NVIDIA IB platforms).
  • Perform fabric bring-up, subnet management (OpenSM), partitioning, and performance tuning.
  • Monitor and optimize network performance, latency, throughput, and congestion control.
  • Integrate InfiniBand with Ethernet-based networking environments.
  • Support RDMA technologies (RoCE, iWARP) and GPUDirect environments.
  • Collaborate with system, storage, and compute teams to support AI/ML and distributed workloads.
  • Perform firmware upgrades, patching, and capacity planning.
  • Troubleshoot Layer 2 / Layer 3 networking issues (BGP, OSPF, VLAN, VXLAN, etc.).
  • Maintain documentation, network diagrams, and SOPs.

Required Skills & Qualifications

  • 5+ years of networking experience with strong fundamentals (TCP/IP, routing, switching).
  • Hands-on experience with InfiniBand technologies (HDR/NDR preferred).
  • Experience with NVIDIA / Mellanox Technologies switches and adapters.
  • Strong understanding of RDMA, congestion control, QoS, and low-latency tuning.
  • Experience with subnet managers (OpenSM) and fabric diagnostic tools.
  • Solid understanding of BGP, OSPF, EVPN-VXLAN, MPLS (good to have).
  • Experience in HPC, AI/ML cluster networking environments is highly preferred.
  • Familiarity with Linux networking and troubleshooting tools.
  • Experience with automation (Python, Ansible) is a plus.

Preferred Qualifications

  • Experience supporting large GPU clusters.
  • Knowledge of storage networking (NVMe-oF, parallel file systems).
  • Experience with monitoring tools and telemetry systems.
  • Networking certifications (CCNP/CCIE or equivalent).

Key Competencies

  • Strong analytical and troubleshooting skills
  • Ability to work in high-performance, mission-critical environments
  • Excellent documentation and communication skills
  • Proactive problem-solving mindset

More Info

Job Type:
Industry:
Function:
Employment Type:

Job ID: 142817381