Job Description
Job Description L2 Enterprise Monitoring Engineer
Role Overview
The L2 Enterprise Monitoring Engineer is responsible for advanced monitoring, incident analysis, and troubleshooting across infrastructure, applications, and network layers. This role acts as the primary resolver group for monitoring-triggered incidents and plays a key role in reducing alert noise, improving monitoring effectiveness, and driving faster resolution.
L2 engineers are expected to go beyond SOPsanalyze, fix, and improve.
Key Responsibilities
Advanced Monitoring & Event Analysis
- Perform deep analysis of alerts generated from enterprise monitoring tools (SolarWinds, SCOM, Dynatrace, etc.)
- Correlate multiple alerts/events to identify underlying issues (avoid symptom-based handling)
- Fine-tune alert thresholds and suppress false positives
- Identify gaps in monitoring coverage and recommend improvements
Incident Troubleshooting & Resolution
- Take ownership of P2/P3 incidents and support P1 (Major Incidents)
- Perform detailed troubleshooting across:
- Servers (Windows/Linux)
- Network (connectivity, latency, packet loss)
- Applications (availability, performance)
- Execute standard fixes, workarounds, and recovery actions
- Engage L3/OEM vendors when required with proper diagnostics
Major Incident Support (MIM)
- Support Major Incident calls by providing technical insights and updates
- Perform real-time troubleshooting and log analysis during outages
- Ensure quick identification of root cause or workaround
- Provide inputs for incident timelines and updates
Automation & Monitoring Optimization
- Create and enhance monitoring scripts, thresholds, and alert logic
- Automate repetitive tasks using scripting (PowerShell / Shell / Python basic level)
- Drive reduction in alert noise and manual effort
- Contribute to continuous improvement initiatives
Knowledge Management & Documentation
- Create and update Knowledge Base (KB) articles and runbooks
- Document known errors and workarounds
- Ensure troubleshooting steps are reusable by L1 team
Collaboration & Escalation
- Act as technical escalation point for L1 team
- Guide L1 analysts on triage and handling improvements
- Coordinate with cross-functional teams (Infra, App, Network, Cloud)
- Ensure proper escalation to L3 with complete diagnostics
Shift & Operations
- Participate in 24x7 rotational shifts (including weekends/on-call if applicable)
- Ensure high-quality shift handovers with actionable insights
Required Skills & Qualifications
Technical Skills (Core Expectation)
- Strong hands-on experience in:
- Windows & Linux server administration
- Network fundamentals (DNS, TCP/IP, routing basics)
- Application monitoring concepts (APM tools like Dynatrace/AppDynamics preferred)
- Strong working knowledge of monitoring tools:
- SCOM / SolarWinds / Dynatrace / Nagios / Zabbix
- Log analysis skills (Event Viewer, syslogs, basic Splunk/Kibana exposure preferred)
- Basic scripting skills:
- PowerShell / Bash / Python (any one)
Process & Frameworks
- Strong understanding of ITIL:
- Incident Management
- Event Management
- Problem Management (basic involvement)
Soft Skills (Non-Negotiable)
- Strong communicationclear, structured, and confident (especially with US stakeholders)
- Analytical thinking (must move beyond checklist-based work)
- Ownership mindsetdrives issues to closure
- Ability to work under pressure during incidents
Experience & Education
- 35 years of experience in monitoring / infrastructure support / NOC
- Bachelor's degree in IT / Computer Science or related field
- ITIL Foundation (preferred)
- Relevant certifications (Azure/AWS/Monitoring tools) good to have
Qualifications
Graduation
Range Of Year Experience-Min Year
3
Range Of Year Experience-Max Year
5