Search by job, company or skills

GoKwik

Site Reliability Engineer - Onsite

new job description bg glownew job description bg glownew job description bg svg
  • Posted an hour ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are looking for an SRE-focused Engineer to join our DevOps team. This role is 80% Site Reliability Engineering and 20% DevOps enablement, with observability, resilience, and incident management at its core. You will lead on-call operations, build world-class observability systems, and drive reliability engineering practices across the organisation. Alongside this, you will collaborate on automation and CI/CD improvements to ensure services are built and operated for scale. We are an engineering-first team that continuously invests in tools, tests, processes, and technology. We consider our people our biggest asset and strive to build a culture of continuous learning and growth.

Responsibilities

  • Lead SRE practices for reliability, scaling, and performance of production systems.
  • Lead on-call operations and incident response, ensuring fast resolution and minimal customer impact.
  • Perform deep debugging of production issues across infrastructure, services, and databases.
  • Design and automate self-healing, scalable infrastructure.
  • Architect and implement advanced observability (metrics, logs, traces, SLIs/SLOs, APM) to detect, debug, and prevent outages.
  • Support CI/CD and infrastructure automation (Terraform, Kubernetes, pipelines) as part of DevOps responsibilities (20%).
  • Implement and mature observability practices, including SLIs/SLOs, distributed tracing, and APM.
  • Mentor junior engineers in incident management and DevOps best practices.
  • Partner with engineering teams on resilient architecture reviews and reliability improvements.
  • Drive adoption of new tools and best practices to enhance infrastructure reliability.
  • Conduct blameless postmortems, improve incident playbooks, and build a strong prevention culture.

Requirements

  • 5-8 years of experience in SRE / Production Engineering, with some DevOps exposure.
  • Strong expertise in incident management, debugging distributed systems, and on-call operations.
  • Strong background in observability platforms such as Prometheus, Grafana, Datadog, OpenTelemetry, or similar.
  • Deep knowledge of cloud infrastructure (AWS/GCP), including networking, scaling, and HA/DR setups.
  • Hands-on experience with Kubernetes, Terraform, and CI/CD pipelines.
  • Experience with incident frameworks, blameless postmortems, chaos engineering, and resiliency testing
  • Ability to balance short-term firefighting with long-term reliability engineering.
  • Strong scripting skills (Shell, Python, or Go preferred).

This job was posted by Mahima Saraswat from GoKwik.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 143920873