Search by job, company or skills

tsworks

Senior Site Reliability Engineer - SRE

new job description bg glownew job description bg glownew job description bg svg
  • Posted 18 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About tsworks:

tsworks is a leading technology innovator, providing transformative products and services designed for the digital-first world. Our mission is to provide domain expertise, innovative solutions and thought leadership to drive exceptional user and customer experiences. Demonstrating this commitment, we have a proven track record of championing digital transformation for industries such as Banking, Travel and Hospitality, and Retail (including e-commerce and omnichannel), as well as Distribution and Supply Chain, delivering impactful solutions that drive efficiency and growth. We take pride in fostering a workplace where your skills, ideas, and attitude shape meaningful customer engagements.

About Team:

We are looking for an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our team and play a key role in ensuring the high availability, scalability, and reliability of our infrastructure. The ideal candidate will have 7+ years of experience in site reliability engineering, cloud computing, infrastructure automation, and monitoring, with a deep understanding of modern DevOps and SRE practices.
Responsibilities:
  • Architect, design, and maintain high availability, scalable, and resilient infrastructure to support business-critical applications.
  • Lead the implementation and management of Infrastructure as Code (IaC) using AWS CDK, ensuring infrastructure is automated, repeatable, and secure.
  • Develop and optimize automation for deployments, configuration management, and infrastructure provisioning across cloud (AWS) and container orchestration platforms (Kubernetes, EKS, ECS).
  • Enhance and maintain CI/CD pipelines, ensuring smooth and automated application and infrastructure deployments.
  • Design and implement monitoring and observability solutions using tools such as Datadog, Prometheus, Grafana, ensuring proactive identification and resolution of performance bottlenecks and failures.
  • Lead incident response and root cause analysis efforts, ensuring high levels of service availability and quick resolution of infrastructure issues.
  • Continuously improve infrastructure performance, scalability, and reliability through best practices, automation, and innovation.
  • Mentor and coach junior engineers, sharing knowledge, best practices, and expertise in site reliability engineering.

Requirements

Key Attributes and Qualifications:
  • 7-10+ years of experience in Site Reliability Engineering, DevOps, or a related field.
  • Expertise in cloud computing, particularly AWS, with deep knowledge of infrastructure design and best practices.
  • Experience with multi-cloud environments, including Azure and GCP, is highly desirable.
  • Proficiency with AWS CDK is essential, with additional experience in Terraform and Ansible considered a strong advantage.
  • Strong experience with Kubernetes and container orchestration platforms (EKS, ECS), including deploying, scaling, and managing workloads.
  • Advanced scripting and programming skills (Python, Bash, or similar) for automation and infrastructure management.
  • In-depth knowledge of monitoring, logging, and observability tools (Datadog, Prometheus, Grafana, ELK, etc.).
  • Preferred knowledge of Content Delivery Networks (CDNs) for optimizing application performance and scalability.
  • Excellent communication and leadership skills, with experience mentoring junior engineers and driving technical excellence.

Mandatory Work Experience in Project

  • Kubernetes-Docker
  • CI/CID Pipeline
  • Scripting - terraform, helm
  • Monitoring

Good to Have
Application Knowledge (Java/Maven/Angular)



More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 145417371

Similar Jobs