Job Summary:
We are looking for a skilled Site Reliability Engineer (SRE) with strong experience in Microsoft Azure, Terraform (Infrastructure as Code), and Kubernetes (AKS) to ensure the reliability, scalability, and security of cloud-based platforms. The role focuses on building and operating resilient infrastructure, automating cloud environments, and maintaining high availability for production systems.
Key Responsibilities:
- Develop and maintain infrastructure as code using Terraform, enabling repeatable, secure, and auditable infrastructure deployments.
- Provision and manage Azure resources, including networking, identity, Kubernetes (AKS), storage, databases, and event streaming services.
- Deploy, configure, and support Kubernetes clusters, including RBAC, ingress, autoscaling, and secret management.
- Support and manage PostgreSQL and Azure Event Hubs as part of application infrastructure.
- Collaborate with developers, architects, and DevOps engineers to ensure infrastructure aligns with application and security requirements.
- Troubleshoot infrastructure and deployment issues in staging and production environments.
- Contribute to cloud governance, security best practices, and compliance automation.
Required Qualifications:
- Hands-on experience with Terraform and infrastructure-as-code methodologies.
- Strong expertise in Microsoft Azure, including:
- Azure Kubernetes Service (AKS)
- Networking (VNets, Load Balancers, Private Endpoints)
- Azure IAM, identity federation, and managed identities
- Solid understanding of Kubernetes operations, including workload management and security controls.
- Experience operating cloud-native PostgreSQL and event-driven services such as Azure Event Hubs or Kafka.
- Proficiency in automation and scripting (e.g., Bash, PowerShell, Python).
- Familiarity with CI/CD pipelines and Git-based deployment workflows.