Search by job, company or skills

simfluent

SRE Senior Engineer

Save
new job description bg glownew job description bg glow
  • Posted 8 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Overview
We are seeking for the Senior Site Reliability Engineer (SRE) plays a key role in ensuring the scalability, performance, and reliability of mission-critical systems in a fast-paced, cloud-native environment. This role combines software engineering expertise with a deep understanding of distributed systems, automation, and modern infrastructure to design and operate resilient production environments. The Senior SRE will drive improvements to system reliability, streamline operations through automation, and enhance observability practices across teams. Acting as a technical leader and collaborator, this engineer will partner closely with development, operations, and business teams to embed reliability into every stage of the product lifecycle, fostering a culture of continuous improvement and operational excellence.

ShyftLabs is a growing data product company that was founded in early 2020 and works primarily with Fortune 500 companies. We deliver digital solutions built to help accelerate the growth of businesses in various industries, by focusing on creating value through innovation.

Job Responsibilities

  • Define, implement, and evolve SLOs, SLIs, and error budgets in collaboration with product and engineering teams
  • Oversee the reliability, performance, and capacity of production systems, proactively identifying and implementing improvements to enhance system reliability
  • Establish and uphold strong software development practices and standards to enhance long-term maintainability and operational excellence
  • Drive automation for operational tasks, deployments, and recovery playbooks to reduce toil and improve consistency
  • Design and maintain infrastructure and platform reliability using infrastructure as code tools such as Terraform, Ansible, or similar
  • Guide the implementation and management of containerized and cloud-native platforms (for example, Kubernetes) with a focus on resilience, scalability, and safe rollouts
  • Own observability practices and tooling (logging, metrics, tracing, alerting) to ensure proactive detection and fast diagnosis of issues
  • Champion best practices for security, compliance, and governance in production environments
  • Collaborate with cross-functional teams to ensure reliability is considered in architecture, design, and release planning
  • Foster a culture of blameless incident reviews, learning, and continuous improvement within the Reliability Engineering organization

Requirements


  • Site Reliability Engineering
  • Python
  • Java
  • Distributed Systems
  • AWS
  • Azure
  • Google Cloud Platform
  • Docker
  • Kubernetes
  • Terraform
  • Ansible
  • Infrastructure as Code
  • Prometheus
  • Grafana
  • Datadog
  • PagerDuty
  • Observability
  • SLO/SLI Management
  • System Architecture
  • Incident Management
  • Leadership
  • Mentoring

Qualifications


  • Bachelor's degree in computer science, Engineering, or a related field
  • 7+ years of experience in Site Reliability Engineering, Production Engineering, or related fields
  • Strong proficiency in programming languages such as Python or Java
  • Deep understanding of distributed systems design, system architecture, and operational excellence
  • Experience operating large-scale systems on cloud platforms such as AWS, Azure, or Google Cloud Platform
  • In-depth knowledge of containerization and orchestration technologies such as Docker and Kubernetes
  • Experience with infrastructure as code and configuration management tools (for example, Terraform, Ansible, or similar)
  • Hands-on experience with observability and incident management tools (for example, Prometheus, Grafana, Dynatrace, Datadog, PagerDuty, or equivalents)
  • Solid understanding of SRE principles
  • Proven ability to implement and manage observability frameworks for metrics, logging, tracing, alerting, and SLO tracking
  • Excellent problem-solving, troubleshooting, and communication skills, with the ability to influence and collaborate across teams
  • Excellent communication and leadership skills, capable of guiding cross-functional teams and mentoring less-experienced engineers

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148361833