Description
You must have a strong grasp of how key services interact across domainsfrom payment reconciliation to identity verification to customer communications. You anticipate and mitigate systemic risks, lead readiness reviews, and help shape platform roadmaps that balance scale, reliability, and speed. You also collaborate closely with security, IT, sysadmins, and network teams on strategic initiatives such as multi-region availability, compliance automation, VPN routing, SOC2 readiness, data loss prevention, and endpoint security.
What Youll Do
- Lead the design and implementation of scalable, secure, and cost-efficient cloud infrastructure.
- Own the architecture and execution of high-impact platform initiatives (e.g., CD migration, zero-downtime deployments, logging architecture).
- Collaborate with Security, IT, and Infrastructure teams to define and implement organization-wide access, audit, and reliability standards.
- Proactively identify technical debt, scalability concerns, and risk across multiple systems and services.
- Guide platform investment decisions and architecture strategy in partnership with engineering and product leadership.
- Mentor engineers and establish best practices in observability, incident response, and infrastructure evolution.
Requirements
- Deep expertise in a major cloud provider such as AWS, and Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or CDK.
- Extensive hands-on experience with containerization and orchestration technologies, including Docker, Kubernetes (and variants such as EKS), and Helm.
- Strong proficiency in a programming language such as Python or Go for building robust automation and tooling.
- Deep expertise in designing and managing CI/CD pipelines, including modern practices like GitOps using tools such as ArgoCD or Flux.
- Expert-level knowledge of Linux/UNIX systems administration, version control (Git), and system internals.
- Advanced understanding of networking (VPC design, DNS, routing, service mesh) and cloud security (IAM, threat modeling, compliance automation).
- Deep understanding of observability, monitoring, and alerting using tools such as the Prometheus stack (Prometheus, Grafana, Loki) and/or commercial platforms like Datadog.
- Proven experience leading infrastructure strategy and operating high-availability production systems.
- Strong leadership, cross-functional influence, and a business-first mindset when making technical trade-offs.
(ref:hirist.tech)