Own end-to-end incident management: Lead detection, triage, impact assessment, prioritization, and resolution of production incidents within agreed SLAs and OLAs.
Coordinate major incident handling: Act as the primary point of contact during high-severity incidents, driving technical bridges/war rooms and ensuring timely stakeholder communication.
ITIL/ITSM process execution: Apply ITIL-aligned practices for incident, problem, and change management, ensuring adherence to organizational standards and governance.
Root cause analysis: Perform thorough post-incident reviews, document root causes, and define corrective and preventive actions to avoid recurrence.
Service stability and continuous improvement: Identify recurring issues and operational gaps, propose process and tooling improvements, and contribute to reliability and performance enhancements.
Collaboration with engineering and operations: Work closely with development, infrastructure, and QA teams to understand system behavior, dependencies, and release impacts on production.
Monitoring and alert optimization: Review alerts, refine thresholds, and help optimize monitoring dashboards to reduce noise and improve early detection of issues.
Knowledge management: Create and maintain runbooks, standard operating procedures, and knowledge base articles to improve first-time-right resolutions and reduce MTTR.
Risk and change assessment: Participate in change advisory processes, assess production risks, and ensure appropriate validations and rollback plans are in place.
Mentoring and guidance: Support junior team members in incident handling best practices, communication, and adherence to ITSM processes. Minimum Qualifications:
Education: Bachelor's degree in Engineering, preferably B.Tech or equivalent in Computer Science, Information Technology, or related field.
Experience: 815 years of hands-on experience in production support and incident management in enterprise or large-scale environments. Good to have skills: ServiceNow, BMC Remedy, Problem Management, Change Management, Monitoring and Alerting Tools
Project Management fundamentals
Project Lifecycles on development & maintenance projects, estimation methodologies, quality processes.
Knowledge of one or more programming languages; knowledge of architecture frameworks, and design principles; ability to comprehend & manage technology, performance engineering.
Domain Basic domain knowledge in order to understand the business requirements / functionality.
Ability to perform project planning and scheduling, manage tasks and coordinate project resources to meet objectives and timelines
Ability to work with business and technology subject matter experts to assess requirements, define scope, create estimates, and produce project charters
Good understanding of SDLC and agile methodologies is a pre-requisite
Awareness of latest technologies and trends
Logical thinking and problem solving skills along with an ability to collaborate