We are looking for an SRE-focused Engineer to join our DevOps team. This role is 80% Site Reliability Engineering and 20% DevOps enablement, with observability, resilience, and incident management at its core. You will lead on-call operations, build world-class observability systems, and drive reliability engineering practices across the organisation. Alongside this, you will collaborate on automation and CI/CD improvements to ensure services are built and operated for scale. We are an engineering-first team that continuously invests in tools, tests, processes, and technology. We consider our people our biggest asset and strive to build a culture of continuous learning and growth.
Responsibilities
- Lead SRE practices for reliability, scaling, and performance of production systems.
- Lead on-call operations and incident response, ensuring fast resolution and minimal customer impact.
- Perform deep debugging of production issues across infrastructure, services, and databases.
- Design and automate self-healing, scalable infrastructure.
- Architect and implement advanced observability (metrics, logs, traces, SLIs/SLOs, APM) to detect, debug, and prevent outages.
- Support CI/CD and infrastructure automation (Terraform, Kubernetes, pipelines) as part of DevOps responsibilities (20%).
- Implement and mature observability practices, including SLIs/SLOs, distributed tracing, and APM.
- Mentor junior engineers in incident management and DevOps best practices.
- Partner with engineering teams on resilient architecture reviews and reliability improvements.
- Drive adoption of new tools and best practices to enhance infrastructure reliability.
- Conduct blameless postmortems, improve incident playbooks, and build a strong prevention culture.
Requirements
- 5-8 years of experience in SRE / Production Engineering, with some DevOps exposure.
- Strong expertise in incident management, debugging distributed systems, and on-call operations.
- Strong background in observability platforms such as Prometheus, Grafana, Datadog, OpenTelemetry, or similar.
- Deep knowledge of cloud infrastructure (AWS/GCP), including networking, scaling, and HA/DR setups.
- Hands-on experience with Kubernetes, Terraform, and CI/CD pipelines.
- Experience with incident frameworks, blameless postmortems, chaos engineering, and resiliency testing
- Ability to balance short-term firefighting with long-term reliability engineering.
- Strong scripting skills (Shell, Python, or Go preferred).
This job was posted by Mahima Saraswat from GoKwik.