Key Responsibilities
Incident Management & Troubleshooting
- Perform advanced fault isolation for escalated server and network incidents using iDRAC, Redfish, and Cisco CLI tools.
- Execute firmware, BIOS, and driver updates on Dell PowerEdge servers with minimal disruption.
- Conduct IOS/NX-OS firmware and software updates on Cisco routers and switches, adhering to change management processes.
- Handle hardware break/fix procedures, coordinating Dell support, parts, and on-site technicians.
- Conduct network health audits, performance analysis, and recommend optimization measures.
Monitoring & Reporting
- Collaborate with SRE teams to enhance monitoring dashboards and refine alert thresholds.
- Analyze performance metrics for proactive detection of infrastructure instability or security events.
Mentorship & Collaboration
- Mentor L1 engineers via knowledge transfer sessions and complex ticket resolution guidance.
- Participate in blameless post-mortems to identify root causes and implement preventative actions.
- Support capacity planning with FTE IT Team Lead, providing data-driven insights on infrastructure trends.
Documentation & Lifecycle Management
- Maintain and update operational runbooks, network diagrams, and technical documentation.
- Support hardware lifecycle management, including provisioning, asset tracking, and vendor coordination.
- Provide 24x7 on-call support for critical escalations.