Lead NOC Engineer / Manager (Cloud & Infrastructure)

OpsXpress

India

6-8 Years

This job is no longer accepting applications

Posted 3 months ago

Job Description

Location: Preferred - Ahmedabad

Work Mode : Remote

Work Schedule : Flexible (Focus on US Business Hours) leading a 24x7 team

About OpsXpress

OpsXpress is a dynamic operation support startup focused on delivering seamless reliability for our clients. We don't just watch screens—we solve problems. We bridge the gap between traditional IT operations and modern DevOps practices, ensuring our clients businesses run without interruption.

Why Join OpsXpress

Build, Don't Just Maintain: You won't step into a stagnant role. You will build the processes, choose the tools, and shape the culture of our NOC from the ground up.
True Hybrid Exposure: You won't be limited to just one cloud. You will manage a complex, real-world landscape spanning AWS, Azure, Google Cloud, and On-Premise Data Centers.
Career Velocity: As a lead in a growing startup, your pathway to an Operations Manager or SRE role is clear and direct. We invest in upskilling our team to be engineers, not just ticket loggers.

Role Overview

We are looking for a hands-on NOC Lead to captain our 24x7 Network Operations Centre. This is not a passive monitoring role. You will be the technical escalation point for critical incidents across our Client's AWS, Azure, GCP or On-Premise data centres.

Your primary mission is two-fold:

Ensure Stability: Guarantee 99.9% uptime by managing alerts, escalations, and taking immediate corrective action on infrastructure, network, and DB issues.
Transform the Team: Upskill a team of L1/L2 monitoring engineers, moving them from Ticket Watchers to DevOps-aware operators who can use scripts and automation tools to solve problems.

Key Responsibilities

1. Operational Leadership (The Captain Role)

Shift Management - Manage the 24x7 shift roster, ensuring seamless handovers and zero gaps in monitoring coverage.
Incident Commander - Own the Incident Management process. When a P1/P2 alert fires, you are the commander — coordinating with the team, communicating with stakeholders, and driving the resolution.
Refine the Escalation: Matrix. Ensure alerts are routed to the right teams (Engineering, DBA, Security) only when necessary, minimizing false alarms.
Conduct Root Cause Analysis (RCA) for major outages and implement permanent fixes.

2. Hands-on Technical Execution (The Engineer Role)

Multi-Cloud Troubleshooting: deep-dive into issues across AWS (EC2, RDS, VPC), Azure (VMs, VNet), and Google Cloud (Compute Engine, GKE).
On-Premise & Legacy: Support physical data center infrastructure. troubleshoot bare-metal servers, hypervisors (VMware/Hyper-V), and physical storage appliances.
Database Health: Monitor database performance across the board—whether it's RDS in the cloud or a standalone MySQL/PostgreSQL server in the DC. Analyze locks, replication lag, and resource contention.
Networking: Debug connectivity issues involving VPNs, Firewalls, DNS, and Private Subnets. Validate IP whitelisting and security group rules during outages.
Tooling: Manage and tune, monitoring tools (e.g., CloudWatch, Azure Monitor, Datadog, Prometheus, Grafana, New Relic) to reduce alert fatigue.

3. Team Transformation (The DevOps Shift)

Upskill the Team: Mentor junior NOC engineers on Linux basics, cloud fundamentals, and debugging logic.
Automation: Move the team away from manual fixes. Create and maintain Runbooks(SOPs) and scripts (Bash/Python) for common tasks (e.g., disk cleanup, service restarts).
DevOps Integration: Bridge the gap between NOC and DevOps. Teach the team how to read CI/CD pipeline failures and understand basic Infrastructure as Code (Terraform/CloudFormation) errors.

Required Skills & Qualifications

Experience:

6+ years in IT Infrastructure/NOC, with at least 2 years in a Lead or Senior role.
Proven experience managing or mentoring a team in a 24x7 environment.
Must be willing to work flexible hours with a focus on US Business Hours (EST/PST/CST as required).
Having Healthcare domain experience is a plus.

Technical Stack:

Cloud: Strong hands-on experience with AWS & Azure. (Must know: VPC/VNet, EC2/VM, S3/Blob, IAM/RBAC). GCP Skills is desirable.
On-Premise: Experience with Data Center operations, including Linux/Windows Server administration and Virtualization (VMware).
OS:

-Linux - Strong Linux administration skills (Cron, Systemd, Disk Management, Log Analysis).

-Windows - Strong administration skills including (AD, GPO, IIS, etc.), Ability to use Event Viewerand Resource Monitor to diagnose crashes.

Database: Experience monitoring PostgreSQL, MySQL, or MSSQL. Ability to write basic SQL queries to check health is a plus.
Networking: Solid understanding of TCP/IP, HTTP/S, DNS, SSL Certificates, and VPNs. Strong grasp of Hybrid networking (VPNs, Subnetting, DNS, Firewalls).
Scripting: Proficiency in Bash or Python to automate routine checks.
DevOps Tools: Good knowledge of CI/CD pipelines, Version Control (Git), and Infrastructure as Code (Terraform).

Soft Skills:

Crisis Management: Ability to remain calm and decisive during high-pressure outages.
Client-Facing: Ability to communicate complex technical incidents to non-technical client stakeholders clearly and calmly.
Ownership: A Lone Wolf capability—you can troubleshoot an issue from the application layer down to the network cable if needed, without waiting for instructions.
Communication: Excellent written and verbal communication to draft incident reports for senior management.