Job Summary
We are seeking a highly skilled InfiniBand Engineer with strong expertise in advanced networking technologies to design, deploy, and support high-performance, low-latency network infrastructures. The ideal candidate will have hands-on experience with InfiniBand fabrics, data center networking, and large-scale distributed computing environments (HPC / AI / ML clusters).
Key Responsibilities
- Design, implement, and manage large-scale InfiniBand (IB) fabrics in data center and HPC environments.
- Configure and troubleshoot InfiniBand switches and adapters (e.g., Mellanox / NVIDIA IB platforms).
- Perform fabric bring-up, subnet management (OpenSM), partitioning, and performance tuning.
- Monitor and optimize network performance, latency, throughput, and congestion control.
- Integrate InfiniBand with Ethernet-based networking environments.
- Support RDMA technologies (RoCE, iWARP) and GPUDirect environments.
- Collaborate with system, storage, and compute teams to support AI/ML and distributed workloads.
- Perform firmware upgrades, patching, and capacity planning.
- Troubleshoot Layer 2 / Layer 3 networking issues (BGP, OSPF, VLAN, VXLAN, etc.).
- Maintain documentation, network diagrams, and SOPs.
Required Skills & Qualifications
- 5+ years of networking experience with strong fundamentals (TCP/IP, routing, switching).
- Hands-on experience with InfiniBand technologies (HDR/NDR preferred).
- Experience with NVIDIA / Mellanox Technologies switches and adapters.
- Strong understanding of RDMA, congestion control, QoS, and low-latency tuning.
- Experience with subnet managers (OpenSM) and fabric diagnostic tools.
- Solid understanding of BGP, OSPF, EVPN-VXLAN, MPLS (good to have).
- Experience in HPC, AI/ML cluster networking environments is highly preferred.
- Familiarity with Linux networking and troubleshooting tools.
- Experience with automation (Python, Ansible) is a plus.
Preferred Qualifications
- Experience supporting large GPU clusters.
- Knowledge of storage networking (NVMe-oF, parallel file systems).
- Experience with monitoring tools and telemetry systems.
- Networking certifications (CCNP/CCIE or equivalent).
Key Competencies
- Strong analytical and troubleshooting skills
- Ability to work in high-performance, mission-critical environments
- Excellent documentation and communication skills
- Proactive problem-solving mindset