Primary Skills: Azure Cloud, Kubernetes, Docker, CI/CD, Troubleshooting
Role Overview
We are looking for a highly motivated
Site Reliability Engineer (SRE) with
5+ years of experience to ensure the reliability, scalability, and performance of cloud-based applications. The ideal candidate will have strong expertise in
Microsoft Azure, containerization, and production support, along with a proactive ownership mindset.
Key Responsibilities
- Ensure high availability, reliability, and performance of production systems hosted on Microsoft Azure
- Manage and maintain containerized workloads using Docker and Kubernetes
- Design, implement, and optimize CI/CD pipelines using tools like Azure DevOps or similar platforms
- Troubleshoot production issues, perform root cause analysis (RCA), and implement preventive measures
- Manage and support applications hosted on:
- Azure App Services
- Virtual Machines (VM-hosted applications)
- Azure Front Door (traffic routing, CDN, WAF configurations)
- Monitor system health using observability tools and proactively resolve incidents
- Implement automation to reduce manual intervention and improve system efficiency
- Ensure security, compliance, and best practices across cloud environments
- Collaborate with development and infrastructure teams for seamless deployments
- Take full ownership of services, ensuring SLA/SLO adherence
Required Skills & Qualifications
- 5+ years of experience in SRE / DevOps / Production Engineering roles
- Strong hands-on experience with Microsoft Azure services (App Services, VMs, Networking, Azure Front Door)
- Solid experience in containerization using Docker and orchestration using Kubernetes (AKS preferred)
- Experience in building and managing CI/CD pipelines (Azure DevOps preferred)
- Strong troubleshooting and debugging skills in distributed systems
- Experience with monitoring/logging tools (Azure Monitor, Log Analytics, Prometheus, Grafana, etc.)
- Good understanding of networking concepts (DNS, Load Balancers, CDN, WAF)
- Scripting knowledge (Bash, PowerShell, or Python)
Preferred Qualifications
- Experience with Infrastructure as Code (Terraform, ARM Templates, Bicep)
- Knowledge of Azure Workload Identity, RBAC, and security best practices
- Experience in handling production incidents and on-call rotations
- Exposure to performance tuning and cost optimization in cloud
Key Traits
- Strong ownership mindset and accountability
- Excellent problem-solving and analytical skills
- Ability to work in high-pressure production environments
- Effective communication and collaboration skills
Nice to Have
- Experience with multi-region deployments and disaster recovery
- Familiarity with microservices architecture
- Knowledge of SRE principles (SLI/SLO/SLA, error budgets, etc.)