Search by job, company or skills

CirrusLabs

Network Engineer – AI Data Center Networking

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted an hour ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are hiring a talented < Network Engineer – AI Data Center Networking)> to join our team. If you're excited to be part of a winning team, Cirrus Labs (http://www.cirruslabs.io) is a great place to grow your career.

Experience : 3+ years

Location: Bangalore or Hyderabad preferred

Shift Time: 2 PM to 11 PM IST

Position Summary

We are seeking a mid-level Network Engineer to support and operate high-performance data center networking environments used for AI and other performance-sensitive workloads. This role is suited to someone with strong hands-on experience in data center or enterprise network operations, solid L2/L3 and TCP/IP fundamentals, and proven ability to troubleshoot across physical, network, and transport layers.

The engineer will be responsible for supporting modern data center networking environments built around high-speed switching and routing, leaf-spine architectures, ECMP, VLAN/VRF segmentation, VXLAN overlay/underlay models, and hybrid connectivity to cloud VPC environments. The role requires a strong operational mindset with focus on stability, uptime, observability, and performance troubleshooting across on-prem infrastructure connecting compute, storage, and cloud environments.

This is not a DevOps or platform engineering role. However, the engineer should be comfortable using network-focused automation tools and basic scripting to improve repeatability and reduce manual effort, while working closely with DevOps teams to integrate network automation into broader workflows.

Key Responsibilities

  • Operate and support data center network infrastructure, including high-speed switching, routing, optics, cabling, and transceivers in on-prem environments.
  • Manage IP addressing, VLANs, VRFs, and segmentation constructs to support secure and scalable network operations.
  • Support modern data center architectures including leaf-spine fabrics, ECMP-based forwarding, and VXLAN overlay/underlay environments, including day-2 operations.
  • Execute network changes, maintenance activities, upgrades, and validation tasks with strong focus on availability, reliability, and operational discipline.
  • Troubleshoot L1–L3 issues impacting network performance, including optics/cabling faults, interface errors, MTU mismatches, routing behavior, packet loss, congestion, retransmissions, and ECMP pathing.
  • Diagnose latency, throughput, drop, and east-west traffic performance issues in high-performance environments.
  • Use Linux CLI, logs, and common network troubleshooting tools to investigate incidents and validate infrastructure behavior.
  • Partner with compute, storage, application, and platform teams to isolate whether bottlenecks originate from the network, host, or application layer.
  • Develop and maintain network automation artifacts such as templates, playbooks, and scripts to improve consistency and reduce manual tasks.
  • Work with DevOps teams to integrate network automation into existing deployment and operational workflows without owning the DevOps platform.
  • Support network observability and telemetry, including streaming telemetry, flow data, and performance-focused monitoring capabilities. They should be very familiar with end-to-end monitoring/alerting of production network(s). The specific tool we'll use in this environment is NetQ but something equivalent would be ok.
  • Contribute to continuous improvement of network operations through standardization, documentation, and operational best practices.

Must-have (Required Qualifications)

  • Must have:3–5+ years in data center or enterprise network operations/support
  • Must have real production support experience.
  • CCNA-level knowledge or equivalent hands-on networking foundation
  • Enough to be productive quickly without teaching basic networking.
  • Strong L2/L3 and TCP/IP fundamentals
  • Proven troubleshooting across L1–L3: physical, switching, routing, forwarding.
  • Experience supporting on-prem physical infrastructure
  • Switches, optics, cabling, transceivers, port health, physical fault isolation.
  • Strong Linux operational comfort
  • CLI, logs, interfaces, routing tables, packet tools, basic host/network troubleshooting.
  • Need exposure on leafspine architecture.
  • General network monitoring and observability experience
  • Must be strong in end-to-end production monitoring, alerting, and fault isolation; NetQ-equivalent experience is acceptable.
  • Need EVPNVXLAN Experience
  • Ability to troubleshoot performance issues across the traffic path
  • Latency, packet loss, congestion, MTU, retransmissions, ECMP pathing.
  • Understanding of modern data center architecture
  • Leaf-spine, ECMP, VLAN/VRF segmentation, east-west traffic patterns for AI/HPC-style environments
  • Familiarity with underlay/overlay concepts in VXLAN environments
  • Enough to support day-2 operations and trace traffic logically and physically from point A to point B
  • Foundational knowledge of RDMA/RoCE concepts
  • Not deep design expertise, but enough to understand AI/HPC traffic sensitivity and why congestion behavior matters
  • Working exposure to automation/config management
  • Ansible-like repeatable playbooks, structured execution, and basic scripting for network operations. This should be operational exposure, not deep Python engineering.
  • Ability to collaborate with DevOps teams
  • Can work within existing tooling, workflows, and change practices.
  • Experience working in Agile/Kanban fashion
  • Basic VPC-to-cloud awareness
  • Keep this as the lowest-priority must-have, since it is not central to the role you described.

Nice to have:

Experience with NVIDIA NetQ or equivalent tools such as NMX

  • Helpful because NetQ aligns closely to the telemetry-heavy operating model used in Spectrum environments
  • Exposure to NVIDIA Spectrum switching platforms/ecosystem
  • Valuable, but teachable within a month for someone strong in core networking
  • In modern NVIDIA AI fabrics, telemetry and operations visibility are a core part of day-2 support, not an afterthought
  • Hybrid networking experience
  • Cloud interconnectivity, routing, and policy across on-prem and cloud.
  • Production EVPN-VXLAN experience
  • MP-BGP EVPN, VNIs, anycast gateway, multi-tenancy/VRFs, day-2 troubleshooting
  • Good structured automation practices
  • Templates, version control, peer review, rollback discipline.
  • Basic Python depth
  • Useful, but now clearly lower priority than network troubleshooting, Linux, and monitoring.
  • Kubernetes networking familiarity
  • CNI, overlays, service networking, and interaction with the underlay.
  • Host-side NIC tuning experience
  • Experience with high-performance or parallel storage networking
  • AI/HPC networking experience
  • Rail-optimized, dual-plane, CLOS, GPU clusters, east-west optimization.
  • Exposure to GPU-cluster communication patterns such as NCCL
  • Strong differentiator, not a gate.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147319259