Search by job, company or skills

Zafin

Principal Architect - Azure Platform & Cloud Operations

new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

What's the opportunity

We are looking for a hands-on Principal Azure Platform & Cloud Operations Architect to assess and improve how we run critical production systems on Azure. You will evaluate our current Cloud Operations processes and platform architecture, identify automation and improvement opportunities, implement stronger operational patterns, and act as the escalation SME when the team hits technical roadblocksespecially across AKS, networking, and deployments.

What will you do

Assess and Improve Cloud Operations Processes: Review current operational workflows (provisioning, deployments, incident response, change management) and implement process corrections, automation opportunities, and operational guardrails to reduce manual effort and improve reliability.

Own AKS Platform Architecture and Operations: Lead the design and operational maturity of Azure Kubernetes Service (AKS) environments, including cluster topology, node pools, upgrades, scaling, resiliency patterns, ingress/egress, workload identity, secrets, and runtime security.

Lead Azure Networking Architecture and Troubleshooting: Provide deep expertise in Azure networking and connectivity patterns (VNET design, routing/UDRs, NSGs, DNS, private endpoints, firewalls, load balancers, gateways, and secure egress/ingress) and troubleshoot complex network and performance issues impacting production systems.

Deliver Hands-on Infrastructure as Code (IaC): Design and implement IaC using Terraform with reusable modules, clear lifecycle management, environment consistency, and safe change practices.

Advance GitOps and Deployment Standardization: Strengthen deployment maturity using Argo CD and Helm, improving repeatability, release confidence, environment promotion, and rollback strategies.

Improve CI/CD and Release Automation: Enhance CI/CD pipelines (Azure DevOps / Jenkins / GitHub Actions) to implement quality gates, validation, security scanning, and automated delivery patterns to production.

Implement Observability and Operational Readiness: Improve monitoring, logging, alerting, and dashboards using Azure Monitor, Log Analytics, and Application Insights to create actionable signals and reduce noise; promote production readiness practices (runbooks, readiness reviews, operational checklists).

Provide L3/L4 Escalation and Incident Leadership: Act as the technical escalation point for high-severity incidents, guiding triage and recovery, leading root cause analysis, and ensuring corrective/preventive actions are implemented through automation and platform improvements.

Coach and Unblock the Cloud Operations Team: Mentor engineers and provide hands-on guidance during complex technical challenges, raising overall capability and establishing consistent engineering standards.

Collaborate Across Teams: Work closely with Engineering, SRE, Security, and Delivery teams to align operational patterns, platform guardrails, and production readiness across services and environments.

What do you need to succeed

  • Bachelor's or Master's degree in Computer Science, Engineering, or related field.
  • 8+ years of hands-on experience in cloud platform engineering, DevOps/SRE, or cloud operations, with ownership of production-grade systems.
  • Strong hands-on experience with Azure, particularly Azure Kubernetes Service (AKS), and deep experience running Kubernetes in production (upgrades, scaling, failure modes, troubleshooting).
  • Deep expertise in Azure networking and secure connectivity patterns, with the ability to diagnose complex multi-layer issues across AKS + network + application boundaries.
  • Proven hands-on experience implementing IaC with Terraform (modules, state strategy, environment consistency, safe rollout practices).
  • Strong experience with GitOps and deployment tooling, including Argo CD and Helm, and strong understanding of release strategies and operational controls.
  • Proficiency managing CI/CD pipelines and automation (Azure Pipelines, Jenkins, GitHub Actions) and improving deployment reliability through automated checks and gates.
  • Hands-on experience with Azure observability tooling (Azure Monitor, Log Analytics, Application Insights) to improve service health visibility and incident response effectiveness.
  • Proficiency in scripting/automation with Python and/or Bash/PowerShell to build operational tooling and reduce repetitive manual work.
  • Strong problem-solving and communication skills, with the ability to operate calmly under pressure and guide teams through critical production incidents.

Nice to have:

  • Microsoft certifications (e.g., Azure Solutions Architect Expert) and/or Kubernetes certifications (CKA/CKS).
  • Experience building golden path templates/self-service workflows; exposure to SRE practices; exposure to Azure cost optimization/FinOps.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144634439