JD – Platform Engineer (SRE)
Experience: 10+ years in SRE / DevOps / Platform Engineering
Work Mode: Remote
Position Type: Contract
Job Summary
We are seeking a highly experienced Platform Engineer – Site Reliability Engineer (SRE) to design, build, and operate secure, scalable, and highly available cloud-native platforms on Microsoft Azure. The ideal candidate will have deep hands-on expertise in Azure Kubernetes Service (AKS), Infrastructure as Code, automation, and observability, with a strong focus on reliability and operational excellence. This role combines software engineering and infrastructure skills to ensure resilient platforms, frictionless deployments, and a strong developer experience.
Must-Have Skills
- 10+ years of experience in SRE / DevOps / Platform Engineering roles with production ownership.
- Strong hands-on experience with Microsoft Azure services (networking, compute, storage, security, identity).
- Expertise in Azure Kubernetes Service (AKS) – cluster design, administration, troubleshooting, and scaling.
- Strong experience with Docker and Kubernetes ecosystem tools (ingress controllers, Helm, etc.).
- Proficiency in Infrastructure as Code using Terraform (modules, state management, reusable patterns).
- Experience building and maintaining CI/CD pipelines (Azure DevOps preferred; GitHub Actions/Jenkins acceptable).
- Strong scripting/programming skills in Python, Go, and/or Bash for automation and tooling.
- Solid understanding of Linux fundamentals, networking concepts, and distributed systems.
- Proven experience in incident management, root cause analysis, and leading production support.
- Hands-on experience implementing observability (logs, metrics, traces, alerts) for platforms and services.
Good to Have Skills
- Experience designing and operating multi-region, highly available architectures on Azure.
- Knowledge of service mesh, ingress, and traffic management patterns in Kubernetes (e.g., Istio, NGINX, API gateways).
- Experience with GitOps practices and tools (e.g., Argo CD, Flux).
- Familiarity with security and compliance frameworks in enterprise environments.
- Azure certifications (e.g., AZ-104, AZ-400, AZ-305, or Kubernetes certifications like CKA/CKAD).
- Experience improving developer experience through self-service platforms, templates, and golden paths.
Roles & Responsibilities
Platform Engineering & Azure Infrastructure
- Design, build, and maintain secure, scalable, cloud-native platforms on Microsoft Azure for production workloads.
- Architect, provision, and manage Azure Kubernetes Service (AKS) clusters, including cluster upgrades, scaling, and capacity planning.
- Implement Infrastructure as Code (IaC) using Terraform to provision and manage Azure resources consistently and repeatably.
- Design and implement multi-region, highly available architectures leveraging native Azure services and best practices.
Kubernetes & Container Orchestration (AKS Focus)
- Deploy, manage, and optimize AKS clusters, ensuring reliability, performance, and cost efficiency.
- Enforce Kubernetes best practices including RBAC, network policies, pod security, resource limits/requests, and node management.
- Manage containerized workloads using Docker, including image build, optimization, and registry management.
- Troubleshoot cluster performance, networking, and workload issues across pods, services, and ingress/egress paths.
Reliability & Operations
- Lead incident response for platform-related issues, including on-call participation, triage, mitigation, and communication.
- Conduct post-incident reviews and drive corrective and preventive actions to improve platform reliability.
- Implement and continuously improve observability for platform components (metrics, logs, traces, dashboards, alerts).
- Ensure high availability, minimal downtime, and scalability for platform services through proactive capacity and reliability planning.
Automation & DevOps
- Build and maintain CI/CD pipelines for infrastructure and application deployments (preferably using Azure DevOps).
- Automate infrastructure provisioning, configuration, deployments, and operational runbooks to reduce manual effort and errors.
- Implement and evolve self-service platform capabilities for development teams (templates, pipelines, standardized environments).
- Work closely with development and security teams to integrate DevOps and SRE best practices across the lifecycle.
Security & Governance
- Implement Azure security best practices, including Managed Identities, Key Vault, RBAC, and network security controls.
- Secure AKS clusters using network policies, pod identity, secrets management, and least-privilege access controls.
- Ensure compliance with enterprise security, governance, and audit requirements across the platform.
- Collaborate with security and compliance teams to address vulnerabilities, harden configurations, and maintain secure baselines.