About tsworks:
tsworks is a leading technology innovator, providing transformative products and services designed for the digital-first world. Our mission is to provide domain expertise, innovative solutions and thought leadership to drive exceptional user and customer experiences. Demonstrating this commitment, we have a proven track record of championing digital transformation for industries such as Banking, Travel and Hospitality, and Retail (including e-commerce and omnichannel), as well as Distribution and Supply Chain, delivering impactful solutions that drive efficiency and growth. We take pride in fostering a workplace where your skills, ideas, and attitude shape meaningful customer engagements.
About Team:
We are looking for an experienced and highly skilled
Senior Site Reliability Engineer (SRE) to join our team and play a key role in ensuring the high availability, scalability, and reliability of our infrastructure. The ideal candidate will have
7+ years of experience in site reliability engineering, cloud computing, infrastructure automation, and monitoring, with a deep understanding of modern DevOps and SRE practices.
Responsibilities:
- Architect, design, and maintain high availability, scalable, and resilient infrastructure to support business-critical applications.
- Lead the implementation and management of Infrastructure as Code (IaC) using AWS CDK, ensuring infrastructure is automated, repeatable, and secure.
- Develop and optimize automation for deployments, configuration management, and infrastructure provisioning across cloud (AWS) and container orchestration platforms (Kubernetes, EKS, ECS).
- Enhance and maintain CI/CD pipelines, ensuring smooth and automated application and infrastructure deployments.
- Design and implement monitoring and observability solutions using tools such as Datadog, Prometheus, Grafana, ensuring proactive identification and resolution of performance bottlenecks and failures.
- Lead incident response and root cause analysis efforts, ensuring high levels of service availability and quick resolution of infrastructure issues.
- Continuously improve infrastructure performance, scalability, and reliability through best practices, automation, and innovation.
- Mentor and coach junior engineers, sharing knowledge, best practices, and expertise in site reliability engineering.
Requirements
Key Attributes and Qualifications:
- 7-10+ years of experience in Site Reliability Engineering, DevOps, or a related field.
- Expertise in cloud computing, particularly AWS, with deep knowledge of infrastructure design and best practices.
- Experience with multi-cloud environments, including Azure and GCP, is highly desirable.
- Proficiency with AWS CDK is essential, with additional experience in Terraform and Ansible considered a strong advantage.
- Strong experience with Kubernetes and container orchestration platforms (EKS, ECS), including deploying, scaling, and managing workloads.
- Advanced scripting and programming skills (Python, Bash, or similar) for automation and infrastructure management.
- In-depth knowledge of monitoring, logging, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
- Preferred knowledge of Content Delivery Networks (CDNs) for optimizing application performance and scalability.
- Excellent communication and leadership skills, with experience mentoring junior engineers and driving technical excellence.
Mandatory Work Experience in Project
- Kubernetes-Docker
- CI/CID Pipeline
- Scripting - terraform, helm
- Monitoring
Good to Have
Application Knowledge (Java/Maven/Angular)