The ideal candidate will be responsible for building, operating, and scaling reliable infrastructure platforms. They will work closely with engineering teams to design and implement secure, automated, and highly available systems that enable teams to ship high-quality software efficiently.
Responsibilities:
- Design, build, and operate scalable, secure, and highly available infrastructure and internal platforms.
- Own the platform end-to-end, including architecture, automation, reliability, performance, and cost optimization.
- Collaborate with software engineering teams to enable smooth CI/CD workflows and developer self-service.
- Implement Infrastructure as Code (IaC) to provision and manage cloud and on-prem infrastructure.
- Build and maintain CI/CD pipelines to support fast and reliable deployments.
- Manage containerized workloads and orchestration platforms (Docker, Kubernetes).
- Monitor systems, troubleshoot complex production issues, and lead root cause analysis for incidents.
- Implement observability, logging, alerting, and incident response best practices.
- Drive security, compliance, and best practices across infrastructure and deployment workflows.
- Continuously improve the platform by identifying bottlenecks, reducing toil, and improving developer productivity.
Requirements:
- Strong ownership mindset with the ability to work in a fast-paced, high-availability environment.
- 35+ years of experience in Systems Engineering, DevOps, or Platform Engineering roles.
- Deep expertise in Linux systems, networking, and troubleshooting at scale.
- Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, or equivalent).
- Hands-on experience with CI/CD tools (GitHub Actions, GitLab CI, Jenkins, etc.).
- Solid understanding of containerization and orchestration technologies (Docker, Kubernetes).
- Experience working with at least one major cloud provider (AWS / GCP / Azure).
- Proficiency in scripting or programming (Bash, Python, Go preferred).
- Strong communication skills and ability to collaborate with cross-functional teams.
Qualifications:
- Bachelor's degree in Computer Science or equivalent practical experience.
- Experience supporting production systems with high availability and reliability requirements.
- Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK, OpenTelemetry).
- Exposure to security best practices, IAM, secrets management, and incident management.
- Experience working in environments following SRE or DevOps best practices is a plus.