Search by job, company or skills

GoKwik

Senior DevOps Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 12 hours ago
  • Be among the first 20 applicants
Early Applicant

Job Description

We are looking for an SRE-focused Engineer to join our DevOps team. This role is 80% Site Reliability Engineering and 20% DevOps enablement, with observability, resilience, and incident management at its core. You will lead on-call operations, build world-class observability systems, and drive reliability engineering practices across the organization. Alongside, you'll also collaborate on automation and CI/CD improvements to ensure services are built and operated for scale. We are an engineering-focused team continuously investing in tools, tests, processes, and technology. We consider our people to be our biggest asset and strive to build a culture of continuous learning and growth.

Responsibilities

  • Lead SRE practices for reliability, scaling, and performance of production systems.
  • Lead on-call operations and incident response, ensuring fast resolution and minimising customer impact.
  • Perform deep debugging of production issues across infra, services, and databases.
  • Design and automate self-healing, scalable infrastructure.
  • Architect and implement advanced observability (metrics, logs, traces, SLIs/SLOs, APM) to detect, debug, and prevent outages.
  • Support CI/CD and infra automation (Terraform, Kubernetes, pipelines) as part of DevOps responsibilities (20%).
  • Implement and mature observability practices (SLIs/SLOs, distributed tracing, APM).
  • Mentor junior engineers in incident management and DevOps best practices.
  • Partner with engineering teams on resilient architecture reviews.
  • Commitment to continuous innovation by researching and proposing the adoption of new tools and industry best practices to enhance infrastructure reliability.
  • Conduct blameless postmortems, improve incident playbooks, and drive prevention culture.

Requirements

  • 5-8 years of experience in SRE / Production Engineering (with some DevOps exposure).
  • Proven expertise in incident management, debugging distributed systems, and on-call operations.
  • Strong background in observability platforms (Prometheus, Grafana, Datadog, OpenTelemetry, or similar).
  • Deep knowledge of cloud infra (AWS/GCP), including networking, scaling, HA/DR.
  • Hands-on with Kubernetes, Terraform, and CI/CD pipelines.
  • Experience with incident frameworks, blameless postmortems, chaos/ resiliency testing.
  • Ability to balance short-term firefighting with long-term reliability engineering.
  • Strong scripting skills (Shell, Python, or Go preferred).

Must-Have Cultural Traits

  • Commitment to fostering a culture of reliability through teamwork, blameless postmortems, continuous learning, and proactive risk management.
  • Relentless focus on delivering 99.99999% uptime without compromising merchant trust or production stability.
  • Passion for building and scaling high-impact infrastructure that supports GoKwik's global marketplace at unprecedented scale.
  • Proactive risk identification and mitigation rather than reactive firefighting.

This job was posted by Nirvesh Mehrotra from GoKwik.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 143848897

Similar Jobs

(estd)