Why Join Us
The NOC Team Leader will oversee 24/7 infrastructure, application services, and production systems, driving high availability, alert response, batch job monitoring, and cross-team collaboration for uninterrupted SaaS/web services. This senior role demands strategic leadership in a fast-paced, multi-shift environment with emphasis on troubleshooting, metrics reporting, and process optimization.
Key Responsibilities
Major Responsibilities
- Team Leadership & Management
- Lead, mentor, and support a team of NOC engineers across all shifts, guiding them in monitoring production systems, applications, batch jobs, and diagnostic tools.
- Set priorities, distribute tasks, ensure proper workload balance, and track issues through first-level analysis to closure with IT teams.
- Drive professional development through training, coaching, ongoing feedback, and contributions to knowledge base articles, process documents, and playbooks.
- Conduct periodic 1:1 meetings, performance evaluations, and goal-setting.
- Recruit, onboard, and integrate new NOC engineers into the team.
- Build and maintain a culture of accountability, high performance, service quality, and proactive collaboration with Applications, Systems, Database, and Network teams.
- Operational Oversight
- Own the day-to-day operations of the entire NOC function, ensuring consistent monitoring, alert handling, batch job troubleshooting, operational routine execution, and impact assessment on production schedules.
- Ensure all teams consistently follow predefined procedures, escalation paths, runbooks, and change management for production environments, equipment, OS, applications, and databases.
- Validate and improve health checks, monitoring dashboards (e.g., LogicMonitor), operational KPIs, and performance metrics reports (daily/weekly/monthly).
- Oversee shift handovers, ensuring accuracy, clarity, and continuity of operations.
- Incident Management
- Serve as the primary incident coordinator for major incidents (P1/P2), oversee response efforts across shifts, perform triage, prioritization, mitigation, and collaborate/escalate with Support groups, Service Owners, Vendors, and Third Party Providers.
- Ensure correct triage, prioritization, mitigation actions by the team, using Incident and Problem Management tools.
- Coordinate escalation to Tier 2/3, Infrastructure, Security, and relevant stakeholders.
- Lead post-incident reviews, ensuring documentation, root cause analysis, follow-up action items, and optimization of application performance and batch streams.
- Service Quality & Continuous Improvement
- Monitor team performance, SLAs, KPIs, and production metrics; ensure targets are met or exceeded through proactive work with teams to optimize application/batch performance.
- Identify recurring issues, monitoring gaps, operational inefficiencies, and drive improvement initiatives, including updates to NOC processes, SOPs, runbooks, and documentation.
- Collaborate with cross-functional teams (Infrastructure, Networking, Security, DevOps, Applications, Database, Systems) to enhance system reliability, monitoring coverage, manage production changes, and improve job streams.
- Proactively recommend improvements to monitoring, alerting, automation, NOC workflows, and application/web server technologies.
- Communication & Reporting
- Provide clear and consistent communication to management regarding incidents, trends, risks, operational status, using excellent oral/written English skills.
- Deliver daily/weekly/monthly operational reports, including incident summaries, performance metrics, and team insights.
- Represent the NOC function in internal meetings, service reviews, and cross-team coordination sessions, addressing conflicts constructively.
Qualifications
- Bachelor's Degree in Computer Science, Information Systems, IT, Electrical Engineering, or related field; Master's preferred; or equivalent work experience.
- Certifications: ITIL (v3/4), CCNA, CISSP, PMP, or Agile.
- Proven experience leading or managing technical teams in a NOC, Operations, or Monitoring environment (minimum 3+ years leading teams in 24x7 SaaS/web production settings).
- 10+ years extensive NOC experience with various systems
- Strong troubleshooting expertise across network, system, cloud, application stacks, database management, batch jobs, and production schedules.
- Experience with Linux system administration (logs, services, resource usage, shell scripting/command line) and Windows Server fundamentals.
- Familiarity with cloud platforms (AWS, Azure, GCP) and cloud monitoring concepts.
- Hands-on experience with monitoring/alerting platforms (Icinga, Prometheus, Grafana, PagerDuty, LogicMonitor, or equivalent) and application/web servers (Apache Tomcat, IIS).
- Ability to interpret logs, alerts, metrics, telemetry data, and guide team troubleshooting.
- Experience with ticketing/incident/problem management tools (Jira, ServiceNow).
- Excellent communication skills (high proficiency in English, written/verbal), high situational awareness, calm decision-making under pressure, time management, organizational skills, and ability to handle multiple tasks with minimal supervision.
- Ability to work flexible schedules across shifts.
Good To Have
- Knowledge of key network protocols (TCP/IP, UDP, DNS, HTTP/S, SSH, BGP fundamentals, FTP) and utilities (Telnet, CURL).
- Understanding of VPNs, firewalls, load balancers, proxies, and general IT infrastructure.