Job description
Incident & Problem Manager
KEY EXPECTED ACHIEVEMENTS:
Incident Management:
- Track and manage the status of major incidents, ensuring timely updates and communication to stakeholders.
- Minimize business impact by ensuring efficient incident resolution through coordination with the appropriate support teams.
- Monitor adherence to SLAs, ensuring incidents are resolved within agreed timelines.
- Provide clear and concise updates to senior leadership on the status and progress of major incidents.
Problem Management:
- Drive root cause analysis (RCA) quality to prevent recurrence of incidents.
- Ensure thorough documentation of problem records and RCAs, following industry best practices.
- Monitor and validate the implementation of corrective and preventive actions.
Process Improvement:
- Continuously assess and improve incident and problem management processes to enhance efficiency and effectiveness.
- Develop and implement best practices, leveraging ITIL frameworks where applicable.
- Identify trends and patterns in incidents and problems and recommend proactive solutions.
Collaboration:
- Act as the primary point of contact for major incidents, coordinating with cross-functional teams and external partners.
- Collaborate with teams across different time zones to ensure seamless resolution of incidents.
- Foster strong relationships with internal and external stakeholders, including vendors and third-party support teams.
24x7 Incident Support:
- Ensure 24x7 availability to manage critical incidents, leveraging and coordinating with dedicated support teams.
- Establish and maintain an on-call schedule to address major incident escalations promptly.
Reporting and Metrics:
- Develop and present incident and problem management performance reports, highlighting trends and areas for improvement.
- Track and report on KPIs, including mean time to resolution (MTTR) and first-time fix rates.
Required Technical Skills:
- Strong knowledge of ITIL framework (certification preferred).
- Proficiency in incident and problem management tools such as ServiceNow, Remedy, or similar platforms.
- Experience with root cause analysis techniques and tools.
- Familiarity with infrastructure technologies, including networking, servers, databases, and cloud environments.
- Knowledge of monitoring and alerting tools like Splunk, Dynatrace, or SolarWinds.
- Understanding of cybersecurity principles and their impact on incident resolution.
- Ability to analyze and interpret technical data to identify trends and patterns.
Availability
- Flexibility to work 3-4 days from the office while managing cross-country collaboration remotely.
- Availability to oversee and coordinate 24x7 support for major incidents.