We are seeking a highly accomplished Principal Incident Commander / Director – Incident Management to lead enterprise-wide response to critical incidents across complex, large-scale, and globally distributed infrastructure environments.
This role operates at the intersection of technology leadership, crisis management, and business continuity, requiring the ability to make high-stakes decisions, influence senior stakeholders, and drive rapid resolution during mission-critical outages. The individual will serve as the ultimate authority during major incidents, ensuring minimal business disruption and long-term resilience.
RequirementsStrategic Responsibilities- Own and lead enterprise-level incident management strategy across global operations.
- Act as the executive Incident Commander for P0/P1 incidents impacting business-critical systems.
- Establish and drive incident governance frameworks, SLAs, and response protocols
- Lead cross-functional crisis response involving Network, Cloud, Infrastructure, Security, and Field Operations
- Influence and align with C-suite and senior leadership during high-impact incidents
- Drive business continuity and service resilience initiatives
Operational Leadership- Command and orchestrate war rooms and global bridge calls with multiple stakeholders
- Serve as the highest escalation point for critical outages and service disruptions.
- Ensure rapid triage, containment, and resolution of incidents with minimal downtime
- Drive real-time decision-making under ambiguity and pressure
- Oversee post-incident reviews and enforce accountability across teams
Technical Expertise- Deep expertise in enterprise networking and distributed systems:
- BGP, OSPF, EIGRP, TCP/IP, QoS
- WAN, SD-WAN, Data Center architectures (Spine-Leaf)
- Strong understanding of:
- Load balancing, DNS, DHCP, Network Security
- Latency, packet loss, and performance optimization
- Familiarity with cloud platforms and hybrid infrastructure environments
- Ability to engage in hands-on technical triage when required
- Lead Root Cause Analysis (RCA) at an organizational level
- Drive preventive engineering, automation, and process maturity
- Establish a culture of proactive monitoring and early detection
- Enhance incident response playbooks, runbooks, and training programs
Preferred Qualifications- ITIL Expert / Advanced Incident Management certifications
- Exposure to Disaster Recovery (DR) & Business Continuity Planning (BCP)
- Experience with automation, observability platforms, and AI-driven monitoring
- Track record of driving transformation in incident management practices
- 5 -7 years of experience in Network Engineering, SRE, NOC, or Cloud Operations
- Proven experience handling enterprise-scale, high-impact incidents globally
- Prior experience in large enterprises / telecom / hyperscalers / global tech organizations
- Strong leadership presence with the ability to influence without authority
- Experience working in 24x7, mission-critical environments