We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with 7 to 12 years of experience to manage, scale, and ensure the high availability of our core infrastructure. This role involves deep expertise in cloud services, automation, monitoring, and complex networking to support a high-volume, mission-critical environment.
Key Responsibilities
- Cloud & Infrastructure: Configure, maintain, and manage services and packages on Ubuntu Virtual Machines in Azure. Design and manage Azure components for log storage, management, alerting, and monitoring.
- Networking & Connectivity: Configure and maintain complex network components including Azure Firewall, Route Tables, Virtual Network Gateways, and Express Route. Establish and manage IPsec and Express Route connectivity with external environments. Manage routing, troubleshooting connectivity issues, and support network component migrations with minimal downtime.
- Automation & IaC: Drive automation for all BAU tasks using Terraform, Saltstack, Ansible, and scripting languages. Write new Terraform code for infrastructure components.
- Database & Data Management: Set up and manage high-availability services like Mysql and Aerospike. Implement database replication across regions, manage migrations, and ensure data sync. Handle backups of databases, logs, and configurations.
- Monitoring & Observability: Implement and manage monitoring (e.g., Prometheus, Victoria Metrics, Riemann) and centralized logging (Loki) solutions, with visualization on Grafana. Troubleshoot performance and system issues at the OS, platform, or application level.
- Security & Compliance: Manage firewalls and integrate platform and VM-level services with the SOC. Collaborate with Infosec teams to evaluate and fix security vulnerabilities.
- Capacity & Performance: Conduct proactive capacity planning. Manage critical infrastructure components like Nginx, HA Proxy, Docker, and RMQ.
- Incident Management & DR: Participate in an on-call rotation. Structure and lead incident response, Root Cause Analysis (RCA), and post-mortem creation. Set up and support planning and execution of DR sites and failovers.
Required Technical Expertise
- Core Services: Deep, hands-on experience with Microsoft Azure components, including Virtual Machines (Ubuntu/Linux), Azure Storage Accounts, CosmosDB, and Azure Data Explorer (ADX).
- Networking: Expert-level knowledge in configuring and managing complex Azure networking components: Azure Firewall, Azure Route Tables, Virtual Network Gateways, Azure Express Route, and Azure Private DNS. Must be proficient in setting up and troubleshooting routing using protocols like BGP with on-prem DCs and managing network component migrations with minimal downtime.
- Security/Compliance: Experience integrating platform and VM-level services with the Security Operations Center (SOC) and collaborating with Infosec teams on vulnerability evaluation and remediation.
- Operating Systems & Scripting:
- OS: Expert proficiency in Linux environments, specifically Ubuntu/Linux, for system administration, service configuration, and performance troubleshooting at the OS level.
- High-Level Language: Deep expertise in at least one high-level language (Python, Go, or Java) for writing automation, services, and tooling.
- Shell Scripting:Shell scripting (Bash) mastery is essential for day-to-day operational tasks and automation.
- Monitoring, Observability & Logging:
- Monitoring: Extensive experience implementing and maintaining modern monitoring systems such as Prometheus, Victoria Metrics, and Riemann.
- Logging: Proficiency with centralized log management using Loki for log ingestion, enrichment, lifecycle management, and providing a search/view platform.
- Visualization: Expertise in creating and managing dashboards for visualization and alerting using Grafana.
- Configuration Management & IaC (Infrastructure as Code):
- IaC: Mastery of Terraform for writing new component configurations and building automation for BAU (Business As Usual) tasks.
- Configuration Management: Strong experience with configuration management tools like Saltstack (or Ansible) for automated deployment and configuration of services on VMs.
- Databases & Data Stores:
- High-Availability Data Stores: Hands-on experience setting up, managing, and scaling high-availability databases like Mysql and Aerospike.
- Time-Series/Search: Familiarity with Elastic Search and time-series databases like InfluxDB.
- Replication/DR: Expertise in database replication between different regions, managing database migrations, setting up circular replication, and ensuring data sync during system and network issues.
- Core Infrastructure Services:
- Web/Proxy: Expert management of critical infrastructure components like Nginx and HA Proxy, including proxy management, endpoint addition, header configuration, and writing rewrite rules.
- Messaging/Container: Experience with messaging queues like RMQ (RabbitMQ) and containerization technology like Docker.
- Networking Services: Deep knowledge of DNS and other core network protocols.
Essential Soft Skills & Qualifications
- Ownership and Accountability: A proactive approach to identifying and solving infrastructure challenges before they impact service availability.
- Communication: Excellent written and verbal skills for documenting procedures, creating runbooks, and communicating with technical and non-technical stakeholders.
- Mentorship: (For senior roles) Ability to mentor junior engineers and promote SRE best practices across the organization.
- SLO/SLA Management: Experience defining, monitoring, and meeting Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services.
- Toil Reduction: A commitment to measuring and actively reducing operational toil through automation (e.g., using SRE's Toil Reduction framework).
- Cost Optimization: Experience identifying and implementing cloud resource optimization and cost-saving measures within the Azure environment.