Job Title: Site Reliability Engineer (SRE) – Azure Cloud
Experience: 8+ Years
Work Model: 24×7 Shift-based Operations (Rotational shifts, weekends & on-call support)
Location: Remote - (first week at Kochi/Tvm office for training)
Salary as per Industry standards
Key Responsibilities
Azure Infrastructure Management (Mandatory)
- Manage and support Microsoft Azure infrastructure, ensuring high availability, scalability, and security.
- Administer Azure services including:
- Virtual Machines (Windows & Linux)
- Virtual Networks (VNets), Subnets, NSGs, UDRs
- Load Balancers, Application Gateways
- Azure Firewall, VPN Gateways, ExpressRoute
- Support Azure Active Directory (Entra ID) including RBAC, identity, and access management.
- Manage Azure Storage services (Blob, File, Disk, Queue, Table).
- Provide L1/L2 support for Azure PaaS services such as App Services, Azure SQL, Managed Instances, and AKS.
- Perform capacity planning, performance tuning, and cost optimization.
Monitoring & Observability (Mandatory – Datadog)
- Perform real-time monitoring using:
- Datadog (Mandatory)
- Azure Monitor, Log Analytics, Application Insights
- Configure alerts, dashboards, and proactive monitoring strategies.
- Identify system anomalies and ensure rapid incident response.
Networking & Firewall Management (Mandatory)
- Troubleshoot and manage:
- TCP/IP, DNS, routing, VPNs
- WAN/LAN connectivity issues
- Administer enterprise firewalls such as:
- Fortinet / FortiGate (preferred)
- Configure:
- Site-to-site and client VPNs
- Firewall policies and routing rules
- Collaborate with network teams to ensure secure and stable connectivity.
System Administration & Support
- Manage Windows Server environments including:
- Active Directory, RDS, file and print services
- Perform:
- OS patching, system maintenance, backups, and recovery
- Provide remote support for customer environments.
Incident Management & Customer Support
- Take end-to-end ownership of incidents from detection to resolution.
- Perform root cause analysis (RCA) and implement preventive measures.
- Handle escalations and provide timely updates to stakeholders.
- Support 24×7 operations, including on-call responsibilities.
- Follow ITIL processes for incident, problem, and change management.
Documentation & Collaboration
- Maintain accurate documentation in ticketing systems.
- Create and update runbooks, SOPs, and knowledge articles.
- Collaborate with cross-functional teams to improve reliability and efficiency.
Mandatory Skills
- Strong hands-on experience in Microsoft Azure Infrastructure
- Proven experience with Datadog monitoring and observability
- Strong networking fundamentals (TCP/IP, DNS, VPN, Firewalls)
- Experience with enterprise firewall technologies (Fortinet/Cisco/Palo Alto)
- Excellent communication and customer handling skills
- Strong troubleshooting and analytical abilities
Required Qualifications
- 8+ years of experience in Cloud / Infrastructure / SRE / System Administration
- Hands-on experience in Azure cloud support and operations
- Experience supporting Windows Server environments
- Strong exposure to incident management and production support