Search by job, company or skills

simfluent

Site Reliability Engineer (SRE)

Save
new job description bg glownew job description bg glow
  • Posted an hour ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Overview
The Site Reliability Engineer (SRE) ensures our systems are reliable, efficient, and scalable across dynamic production environments. This role bridges development and operations by applying software engineering principles to infrastructure and operational challenges. The SRE will automate processes, enhance observability, and contribute to maintaining high availability, while collaborating closely with engineering and product teams to build resilient cloud-native systems and continuously improve service reliability.

ShyftLabs is a growing data product company that was founded in early 2020 and works primarily with Fortune 500 companies. We deliver digital solutions built to help accelerate the growth of businesses in various industries, by focusing on creating value through innovation.

Job Responsibilities

  • Define, monitor, and improve SLOs, SLIs, and error budgets in partnership with product and engineering teams
  • Manage the reliability, performance, and capacity of production systems by implementing changes that enhance system stability and scalability
  • Follow and promote strong software practices and standards that improve maintainability and reliability
  • Automate recurring operational tasks, deployments, and recovery processes to reduce manual toil
  • Support and evolve infrastructure reliability through Infrastructure as Code (IaC) tools such as Terraform, Ansible, or similar
  • Deploy and manage containerized and cloud-native workloads using orchestration tools like Kubernetes
  • Implement and maintain observability tools for metrics, logging, and tracing (e.g., Datadog, Dynatrace, Prometheus, Grafana) to ensure proactive problem detection and fast resolution
  • Contribute to improving security, compliance, and governance practices for production systems
  • Collaborate effectively across development, operations, and business teams to identify and address reliability risks early in the lifecycle
  • Participate in incident response, postmortems, and continuous improvement initiatives within the SRE organization

Requirements


  • Python
  • Java
  • Distributed Systems
  • AWS
  • Azure
  • Google Cloud Platform
  • Docker
  • Kubernetes
  • Terraform
  • Ansible
  • CI/CD
  • Unix/Linux
  • Shell Scripting
  • Datadog
  • Grafana
  • Prometheus
  • Infrastructure as Code (IaC)
  • SLO/SLI Management
  • Incident Response

Preferred Skills


  • Dynatrace
  • Security and Compliance

Qualifications


  • Bachelor's degree in computer science, Engineering, or a related field
  • 3-5 years of experience in Site Reliability Engineering, Production Engineering, or related fields
  • Proficiency in programming languages such as Python or Java
  • Solid understanding of distributed systems design and system architecture
  • Experience in cloud platforms such as AWS, Azure, or Google Cloud Platform
  • Strong familiarity with containerization (Docker, Kubernetes) and IaC frameworks (Terraform, Ansible)
  • Experience with CI/CD pipelines, Unix systems, and shell scripting
  • Knowledge of observability and alerting tools (Datadog, Grafana, Prometheus, or equivalents)
  • Excellent problem-solving and troubleshooting abilities
  • Strong communication and teamwork skills with a collaborative mindset

Benefits


  • Competitive salary
  • Strong insurance package
  • Extensive learning and development resources

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148382319

Similar Jobs

Noida, India

Skills:

GitAppdynamicsLinuxDockerAnsiblePrometheusSplunkGrafanaKubernetesAWS

Noida, India

Skills:

PowerShellPrometheusBashGrafanaDockerVirtual MachinesMicrosoft AzureKubernetesPythonAzure DevOpsAzure Front DoorAzure App ServicesLog AnalyticsCI CDAzure Monitor

Remote, India

Skills:

GcpDatadogPrometheusAzureTerraformGrafanaJenkinsAnsibleGitHub ActionsAI-OpsGCP Operations SuiteAzure Monitor

Delhi, Kolkata, Mumbai

Skills:

KubernetesPythonSoftware DevelopmentCCloudJavaSaas

Delhi

Skills:

Devops EngineerPythonBashAwsSRE