Network Engineer – AI Data Center Networking

CirrusLabs

Hyderabad, India

3-5 Years

Save

Posted an hour ago
Be among the first 10 applicants

Early Applicant

Job Description

We are hiring a talented < Network Engineer – AI Data Center Networking)> to join our team. If you're excited to be part of a winning team, Cirrus Labs (http://www.cirruslabs.io) is a great place to grow your career.

Experience : 3+ years

Location: Bangalore or Hyderabad preferred

Shift Time: 2 PM to 11 PM IST

Position Summary

We are seeking a mid-level Network Engineer to support and operate high-performance data center networking environments used for AI and other performance-sensitive workloads. This role is suited to someone with strong hands-on experience in data center or enterprise network operations, solid L2/L3 and TCP/IP fundamentals, and proven ability to troubleshoot across physical, network, and transport layers.

The engineer will be responsible for supporting modern data center networking environments built around high-speed switching and routing, leaf-spine architectures, ECMP, VLAN/VRF segmentation, VXLAN overlay/underlay models, and hybrid connectivity to cloud VPC environments. The role requires a strong operational mindset with focus on stability, uptime, observability, and performance troubleshooting across on-prem infrastructure connecting compute, storage, and cloud environments.

This is not a DevOps or platform engineering role. However, the engineer should be comfortable using network-focused automation tools and basic scripting to improve repeatability and reduce manual effort, while working closely with DevOps teams to integrate network automation into broader workflows.

Key Responsibilities

Operate and support data center network infrastructure, including high-speed switching, routing, optics, cabling, and transceivers in on-prem environments.
Manage IP addressing, VLANs, VRFs, and segmentation constructs to support secure and scalable network operations.
Support modern data center architectures including leaf-spine fabrics, ECMP-based forwarding, and VXLAN overlay/underlay environments, including day-2 operations.
Execute network changes, maintenance activities, upgrades, and validation tasks with strong focus on availability, reliability, and operational discipline.
Troubleshoot L1–L3 issues impacting network performance, including optics/cabling faults, interface errors, MTU mismatches, routing behavior, packet loss, congestion, retransmissions, and ECMP pathing.
Diagnose latency, throughput, drop, and east-west traffic performance issues in high-performance environments.
Use Linux CLI, logs, and common network troubleshooting tools to investigate incidents and validate infrastructure behavior.
Partner with compute, storage, application, and platform teams to isolate whether bottlenecks originate from the network, host, or application layer.
Develop and maintain network automation artifacts such as templates, playbooks, and scripts to improve consistency and reduce manual tasks.
Work with DevOps teams to integrate network automation into existing deployment and operational workflows without owning the DevOps platform.
Support network observability and telemetry, including streaming telemetry, flow data, and performance-focused monitoring capabilities. They should be very familiar with end-to-end monitoring/alerting of production network(s). The specific tool we'll use in this environment is NetQ but something equivalent would be ok.
Contribute to continuous improvement of network operations through standardization, documentation, and operational best practices.

Must-have (Required Qualifications)

Must have:3–5+ years in data center or enterprise network operations/support
Must have real production support experience.
CCNA-level knowledge or equivalent hands-on networking foundation
Enough to be productive quickly without teaching basic networking.
Strong L2/L3 and TCP/IP fundamentals
Proven troubleshooting across L1–L3: physical, switching, routing, forwarding.
Experience supporting on-prem physical infrastructure
Switches, optics, cabling, transceivers, port health, physical fault isolation.
Strong Linux operational comfort
CLI, logs, interfaces, routing tables, packet tools, basic host/network troubleshooting.
Need exposure on leafspine architecture.
General network monitoring and observability experience
Must be strong in end-to-end production monitoring, alerting, and fault isolation; NetQ-equivalent experience is acceptable.
Need EVPNVXLAN Experience
Ability to troubleshoot performance issues across the traffic path
Latency, packet loss, congestion, MTU, retransmissions, ECMP pathing.
Understanding of modern data center architecture
Leaf-spine, ECMP, VLAN/VRF segmentation, east-west traffic patterns for AI/HPC-style environments
Familiarity with underlay/overlay concepts in VXLAN environments
Enough to support day-2 operations and trace traffic logically and physically from point A to point B
Foundational knowledge of RDMA/RoCE concepts
Not deep design expertise, but enough to understand AI/HPC traffic sensitivity and why congestion behavior matters
Working exposure to automation/config management
Ansible-like repeatable playbooks, structured execution, and basic scripting for network operations. This should be operational exposure, not deep Python engineering.
Ability to collaborate with DevOps teams
Can work within existing tooling, workflows, and change practices.
Experience working in Agile/Kanban fashion
Basic VPC-to-cloud awareness
Keep this as the lowest-priority must-have, since it is not central to the role you described.

Nice to have:

Experience with NVIDIA NetQ or equivalent tools such as NMX

Helpful because NetQ aligns closely to the telemetry-heavy operating model used in Spectrum environments
Exposure to NVIDIA Spectrum switching platforms/ecosystem
Valuable, but teachable within a month for someone strong in core networking
In modern NVIDIA AI fabrics, telemetry and operations visibility are a core part of day-2 support, not an afterthought
Hybrid networking experience
Cloud interconnectivity, routing, and policy across on-prem and cloud.
Production EVPN-VXLAN experience
MP-BGP EVPN, VNIs, anycast gateway, multi-tenancy/VRFs, day-2 troubleshooting
Good structured automation practices
Templates, version control, peer review, rollback discipline.
Basic Python depth
Useful, but now clearly lower priority than network troubleshooting, Linux, and monitoring.
Kubernetes networking familiarity
CNI, overlays, service networking, and interaction with the underlay.
Host-side NIC tuning experience
Experience with high-performance or parallel storage networking
AI/HPC networking experience
Rail-optimized, dual-plane, CLOS, GPU clusters, east-west optimization.
Exposure to GPU-cluster communication patterns such as NCCL
Strong differentiator, not a gate.