
Search by job, company or skills

What's the opportunity
We are looking for a hands-on Principal Azure Platform & Cloud Operations Architect to assess and improve how we run critical production systems on Azure. You will evaluate our current Cloud Operations processes and platform architecture, identify automation and improvement opportunities, implement stronger operational patterns, and act as the escalation SME when the team hits technical roadblocksespecially across AKS, networking, and deployments.
What will you do
Assess and Improve Cloud Operations Processes: Review current operational workflows (provisioning, deployments, incident response, change management) and implement process corrections, automation opportunities, and operational guardrails to reduce manual effort and improve reliability.
Own AKS Platform Architecture and Operations: Lead the design and operational maturity of Azure Kubernetes Service (AKS) environments, including cluster topology, node pools, upgrades, scaling, resiliency patterns, ingress/egress, workload identity, secrets, and runtime security.
Lead Azure Networking Architecture and Troubleshooting: Provide deep expertise in Azure networking and connectivity patterns (VNET design, routing/UDRs, NSGs, DNS, private endpoints, firewalls, load balancers, gateways, and secure egress/ingress) and troubleshoot complex network and performance issues impacting production systems.
Deliver Hands-on Infrastructure as Code (IaC): Design and implement IaC using Terraform with reusable modules, clear lifecycle management, environment consistency, and safe change practices.
Advance GitOps and Deployment Standardization: Strengthen deployment maturity using Argo CD and Helm, improving repeatability, release confidence, environment promotion, and rollback strategies.
Improve CI/CD and Release Automation: Enhance CI/CD pipelines (Azure DevOps / Jenkins / GitHub Actions) to implement quality gates, validation, security scanning, and automated delivery patterns to production.
Implement Observability and Operational Readiness: Improve monitoring, logging, alerting, and dashboards using Azure Monitor, Log Analytics, and Application Insights to create actionable signals and reduce noise; promote production readiness practices (runbooks, readiness reviews, operational checklists).
Provide L3/L4 Escalation and Incident Leadership: Act as the technical escalation point for high-severity incidents, guiding triage and recovery, leading root cause analysis, and ensuring corrective/preventive actions are implemented through automation and platform improvements.
Coach and Unblock the Cloud Operations Team: Mentor engineers and provide hands-on guidance during complex technical challenges, raising overall capability and establishing consistent engineering standards.
Collaborate Across Teams: Work closely with Engineering, SRE, Security, and Delivery teams to align operational patterns, platform guardrails, and production readiness across services and environments.
What do you need to succeed
Nice to have:
Job ID: 144634439