We are looking for a Senior Site Reliability Engineer (SRE) with expertise in Kubernetes, AWS, Terraform, and cloud infrastructure automation. This role involves working closely with development teams to enhance reliability, scalability, and operational efficiency across cloud-based systems. The ideal candidate will have strong experience in cloud infrastructure management, automation, incident resolution, and performance optimization.
Responsibilities:
- Manage and optimize Kubernetes clusters hosted in AWS.
- Use Infrastructure as Code (IaC) tools like Terraform and Ansible to manage AWS and Azure Kubernetes resources.
- Collaborate with development teams throughout the software lifecycle, promoting SRE best practices for reliability and scalability.
- Troubleshoot priority incidents, facilitate blameless post-mortems, and drive permanent resolutions.
- Analyze incident trends and usage patterns to implement proactive solutions.
- Design and implement self-healing and resiliency patterns to improve system uptime.
- Automate software upgrades, change management, and release management processes.
- Develop custom automation scripts and tooling to replace manual operational work.
- Optimize infrastructure performance and costs, ensuring efficient cloud resource utilization.
- Participate in on-call rotations for tooling support and 24x7 incident response when required.
Required Skills & Experience:
- 7+ years of professional experience in cloud infrastructure, SRE, or DevOps roles.
- Bachelor's degree in Information Systems, Information Technology, Computer Science, or a related field.
- Hands-on experience administering Kubernetes clusters in production environments.
- Expertise in Infrastructure as Code (IaC) tools, specifically Terraform.
- Proven production operations experience in a cloud environment (AWS/Azure).
- Experience contributing to technology and product strategy for cloud infrastructure.
- Strong background in automation, observability, incident management, and HA architecture.
- Ability to drive operational efficiency, transparency, and optimization in a growing engineering environment.
- Experience with cost and performance optimization for cloud infrastructure.