We are seeking a highly accomplished Principal Incident Commander / Director – Incident Management to lead enterprise-wide response to critical incidents across complex, large-scale, and globally distributed infrastructure environments.
This role operates at the intersection of technology leadership, crisis management, and business continuity, requiring the ability to make high-stakes decisions, influence senior stakeholders, and drive rapid resolution during mission-critical outages. The individual will serve as the ultimate authority during major incidents, ensuring minimal business disruption and long-term resilience.
Requirements
Strategic Responsibilities
- Own and lead enterprise-level incident management strategy across global operations.
- Act as the executive Incident Commander for P0/P1 incidents impacting business-critical systems.
- Establish and drive incident governance frameworks, SLAs, and response protocols.
- Lead cross-functional crisis response involving Network, Cloud, Infrastructure, Security, and Field Operations.
- Influence and align with C-suite and senior leadership during high-impact incidents.
- Drive business continuity and service resilience initiatives.
Operational Leadership
- Command and orchestrate war rooms and global bridge calls with multiple stakeholders.
- Serve as the highest escalation point for critical outages and service disruptions.
- Ensure rapid triage, containment, and resolution of incidents with minimal downtime.
- Drive real-time decision-making under ambiguity and pressure.
- Oversee post-incident reviews and enforce accountability across teams.
Technical Expertise
- Deep expertise in enterprise networking and distributed systems: 1) BGP, OSPF, EIGRP, TCP/IP, QoS 2) WAN, SD-WAN, Data Center architectures (Spine-Leaf)
- Strong understanding of: 1) Load balancing, DNS, DHCP, Network Security 2) Latency, packet loss, and performance optimization
- Familiarity with cloud platforms and hybrid infrastructure environments
- Ability to engage in hands-on technical triage when required
- Lead Root Cause Analysis (RCA) at an organizational level
- Drive preventive engineering, automation, and process maturity
- Establish a culture of proactive monitoring and early detection
- Enhance incident response playbooks, runbooks, and training programs
Preferred Qualifications
- ITIL Expert / Advanced Incident Management certifications
- Exposure to Disaster Recovery (DR) & Business Continuity Planning (BCP)
- Experience with automation, observability platforms, and AI-driven monitoring
- Track record of driving transformation in incident management practices
- 12–18+ years of experience in Network Engineering, SRE, NOC, or Cloud Operations
- Proven experience handling enterprise-scale, high-impact incidents globally
- Prior experience in large enterprises / telecom / hyperscalers / global tech organizations
- Strong leadership presence with the ability to influence without authority
- Experience working in 24x7, mission-critical environments
Benefits
- Health insurance coverage for employees and their families.
- Retirement savings plan with employer matching contributions.
- Opportunities for professional development and advancement within the organization.