We are hiring a talented < Network Engineer – AI Data Center Networking)> to join our team. If you're excited to be part of a winning team, Cirrus Labs (http://www.cirruslabs.io) is a great place to grow your career.
Experience : 3+ years
Location: Bangalore or Hyderabad preferred
Shift Time: 2 PM to 11 PM IST
Position Summary
We are seeking a mid-level Network Engineer to support and operate high-performance data center networking environments used for AI and other performance-sensitive workloads. This role is suited to someone with strong hands-on experience in data center or enterprise network operations, solid L2/L3 and TCP/IP fundamentals, and proven ability to troubleshoot across physical, network, and transport layers.
The engineer will be responsible for supporting modern data center networking environments built around high-speed switching and routing, leaf-spine architectures, ECMP, VLAN/VRF segmentation, VXLAN overlay/underlay models, and hybrid connectivity to cloud VPC environments. The role requires a strong operational mindset with focus on stability, uptime, observability, and performance troubleshooting across on-prem infrastructure connecting compute, storage, and cloud environments.
This is not a DevOps or platform engineering role. However, the engineer should be comfortable using network-focused automation tools and basic scripting to improve repeatability and reduce manual effort, while working closely with DevOps teams to integrate network automation into broader workflows.
Key Responsibilities
- Operate and support data center network infrastructure, including high-speed switching, routing, optics, cabling, and transceivers in on-prem environments.
- Manage IP addressing, VLANs, VRFs, and segmentation constructs to support secure and scalable network operations.
- Support modern data center architectures including leaf-spine fabrics, ECMP-based forwarding, and VXLAN overlay/underlay environments, including day-2 operations.
- Execute network changes, maintenance activities, upgrades, and validation tasks with strong focus on availability, reliability, and operational discipline.
- Troubleshoot L1–L3 issues impacting network performance, including optics/cabling faults, interface errors, MTU mismatches, routing behavior, packet loss, congestion, retransmissions, and ECMP pathing.
- Diagnose latency, throughput, drop, and east-west traffic performance issues in high-performance environments.
- Use Linux CLI, logs, and common network troubleshooting tools to investigate incidents and validate infrastructure behavior.
- Partner with compute, storage, application, and platform teams to isolate whether bottlenecks originate from the network, host, or application layer.
- Develop and maintain network automation artifacts such as templates, playbooks, and scripts to improve consistency and reduce manual tasks.
- Work with DevOps teams to integrate network automation into existing deployment and operational workflows without owning the DevOps platform.
- Support network observability and telemetry, including streaming telemetry, flow data, and performance-focused monitoring capabilities. They should be very familiar with end-to-end monitoring/alerting of production network(s). The specific tool we'll use in this environment is NetQ but something equivalent would be ok.
- Contribute to continuous improvement of network operations through standardization, documentation, and operational best practices.
Must-have (Required Qualifications)
- Must have:3–5+ years in data center or enterprise network operations/support
- Must have real production support experience.
- CCNA-level knowledge or equivalent hands-on networking foundation
- Enough to be productive quickly without teaching basic networking.
- Strong L2/L3 and TCP/IP fundamentals
- Proven troubleshooting across L1–L3: physical, switching, routing, forwarding.
- Experience supporting on-prem physical infrastructure
- Switches, optics, cabling, transceivers, port health, physical fault isolation.
- Strong Linux operational comfort
- CLI, logs, interfaces, routing tables, packet tools, basic host/network troubleshooting.
- Need exposure on leafspine architecture.
- General network monitoring and observability experience
- Must be strong in end-to-end production monitoring, alerting, and fault isolation; NetQ-equivalent experience is acceptable.
- Need EVPNVXLAN Experience
- Ability to troubleshoot performance issues across the traffic path
- Latency, packet loss, congestion, MTU, retransmissions, ECMP pathing.
- Understanding of modern data center architecture
- Leaf-spine, ECMP, VLAN/VRF segmentation, east-west traffic patterns for AI/HPC-style environments
- Familiarity with underlay/overlay concepts in VXLAN environments
- Enough to support day-2 operations and trace traffic logically and physically from point A to point B
- Foundational knowledge of RDMA/RoCE concepts
- Not deep design expertise, but enough to understand AI/HPC traffic sensitivity and why congestion behavior matters
- Working exposure to automation/config management
- Ansible-like repeatable playbooks, structured execution, and basic scripting for network operations. This should be operational exposure, not deep Python engineering.
- Ability to collaborate with DevOps teams
- Can work within existing tooling, workflows, and change practices.
- Experience working in Agile/Kanban fashion
- Basic VPC-to-cloud awareness
- Keep this as the lowest-priority must-have, since it is not central to the role you described.
Nice to have:
Experience with NVIDIA NetQ or equivalent tools such as NMX
- Helpful because NetQ aligns closely to the telemetry-heavy operating model used in Spectrum environments
- Exposure to NVIDIA Spectrum switching platforms/ecosystem
- Valuable, but teachable within a month for someone strong in core networking
- In modern NVIDIA AI fabrics, telemetry and operations visibility are a core part of day-2 support, not an afterthought
- Hybrid networking experience
- Cloud interconnectivity, routing, and policy across on-prem and cloud.
- Production EVPN-VXLAN experience
- MP-BGP EVPN, VNIs, anycast gateway, multi-tenancy/VRFs, day-2 troubleshooting
- Good structured automation practices
- Templates, version control, peer review, rollback discipline.
- Basic Python depth
- Useful, but now clearly lower priority than network troubleshooting, Linux, and monitoring.
- Kubernetes networking familiarity
- CNI, overlays, service networking, and interaction with the underlay.
- Host-side NIC tuning experience
- Experience with high-performance or parallel storage networking
- AI/HPC networking experience
- Rail-optimized, dual-plane, CLOS, GPU clusters, east-west optimization.
- Exposure to GPU-cluster communication patterns such as NCCL
- Strong differentiator, not a gate.