SRE Senior Engineer

simfluent

Noida, India

7-9 Years

Save

Posted 8 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Role Overview
We are seeking for the Senior Site Reliability Engineer (SRE) plays a key role in ensuring the scalability, performance, and reliability of mission-critical systems in a fast-paced, cloud-native environment. This role combines software engineering expertise with a deep understanding of distributed systems, automation, and modern infrastructure to design and operate resilient production environments. The Senior SRE will drive improvements to system reliability, streamline operations through automation, and enhance observability practices across teams. Acting as a technical leader and collaborator, this engineer will partner closely with development, operations, and business teams to embed reliability into every stage of the product lifecycle, fostering a culture of continuous improvement and operational excellence.

ShyftLabs is a growing data product company that was founded in early 2020 and works primarily with Fortune 500 companies. We deliver digital solutions built to help accelerate the growth of businesses in various industries, by focusing on creating value through innovation.

Job Responsibilities

Define, implement, and evolve SLOs, SLIs, and error budgets in collaboration with product and engineering teams
Oversee the reliability, performance, and capacity of production systems, proactively identifying and implementing improvements to enhance system reliability
Establish and uphold strong software development practices and standards to enhance long-term maintainability and operational excellence
Drive automation for operational tasks, deployments, and recovery playbooks to reduce toil and improve consistency
Design and maintain infrastructure and platform reliability using infrastructure as code tools such as Terraform, Ansible, or similar
Guide the implementation and management of containerized and cloud-native platforms (for example, Kubernetes) with a focus on resilience, scalability, and safe rollouts
Own observability practices and tooling (logging, metrics, tracing, alerting) to ensure proactive detection and fast diagnosis of issues
Champion best practices for security, compliance, and governance in production environments
Collaborate with cross-functional teams to ensure reliability is considered in architecture, design, and release planning
Foster a culture of blameless incident reviews, learning, and continuous improvement within the Reliability Engineering organization

Requirements

Site Reliability Engineering
Python
Java
Distributed Systems
AWS
Azure
Google Cloud Platform
Docker
Kubernetes
Terraform
Ansible
Infrastructure as Code
Prometheus
Grafana
Datadog
PagerDuty
Observability
SLO/SLI Management
System Architecture
Incident Management
Leadership
Mentoring

Qualifications

Bachelor's degree in computer science, Engineering, or a related field
7+ years of experience in Site Reliability Engineering, Production Engineering, or related fields
Strong proficiency in programming languages such as Python or Java
Deep understanding of distributed systems design, system architecture, and operational excellence
Experience operating large-scale systems on cloud platforms such as AWS, Azure, or Google Cloud Platform
In-depth knowledge of containerization and orchestration technologies such as Docker and Kubernetes
Experience with infrastructure as code and configuration management tools (for example, Terraform, Ansible, or similar)
Hands-on experience with observability and incident management tools (for example, Prometheus, Grafana, Dynatrace, Datadog, PagerDuty, or equivalents)
Solid understanding of SRE principles
Proven ability to implement and manage observability frameworks for metrics, logging, tracing, alerting, and SLO tracking
Excellent problem-solving, troubleshooting, and communication skills, with the ability to influence and collaborate across teams
Excellent communication and leadership skills, capable of guiding cross-functional teams and mentoring less-experienced engineers

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.