Search by job, company or skills

talentiser

Platform Reliability Engineer

Save
  • Posted 22 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About the Role

We're looking for a Backend Platform Reliability Engineer to own the reliability, operability, and evolution of our internal engineering platform. This is a hands-on role at the intersection of platform engineering, reliability, and intelligent automation with a clear mandate: reduce toil, improve observability, and enable systems (AI agents) to safely operate at scale.

You'll work directly with engineering teams to harden services, respond to incidents, and build automation that makes the platform increasingly self-managing over time. A key aspect of this role is designing and operating AI-driven and agent-based workflows, including the guardrails, validation systems, and observability needed to allow automated systems to safely generate and act on changes in production environments.

What You'll Do

Own reliability, availability, and performance of the internal platform and critical services

Participate in on-call rotations; lead incident triage, debugging, root cause analysis, and post-mortems

Build and operate platform automation and AI-powered workflows (including agent-based systems) to reduce manual operational effort

Design and implement guardrails, validation pipelines, and safety mechanisms for automated and AI-generated changes to code and infrastructure

Enable closed-loop automation systems (detect → diagnose → remediate → validate) to improve system resilience

Define and track SLIs and SLOs; use reliability data to guide engineering decisions

Standardize build, deployment, and release workflows for safe, predictable delivery, including automation-friendly and AI-integrated pipelines

Identify and remediate security vulnerabilities across systems and services, including risks introduced by automated changes

Partner with development teams on service design, resilience, and operability, with an emphasis on automation-first and AI-compatible system design

Required Qualifications

6+ years of experience operating production platforms or large-scale distributed systems

Proven track record in incident management, on-call operations, and production debugging

Strong programming skills in Python, Java, Go

Hands-on experience with observability tooling (monitoring, alerting, logging, tracing)

Experience building or maintaining CI/CD pipelines and release processes

Familiarity with platform upgrades, dependency management, and system lifecycle operations

Experience building or integrating AI-driven (agent-based) automation frameworks, or strong interest in this space

Working knowledge of Linux-based production environments

Strong communication and cross-team collaboration skills

Nice to Have

Experience with SRE frameworks: SLOs, error budgets, reliability reviews

Experience with chaos engineering or resilience testing

Background in building self-healing systems

History of driving platform standardization across large engineering organizations

What Success Looks Like at 6 Months

Platform reliability metrics are tracked, visible, and trending in the right direction

On-call burden is measurably reduced through automation and better runbooks

At least one significant automation or autonomous remediation initiative shipped and adopted by engineering teams

Platform upgrades and rollouts are executed safely with documented processes

AI-driven or automated changes are safely deployed with clear guardrails, observability, and rollback mechanism

Key Traits

Strong ownership mentality. Calm under pressure. Bias toward automation. Systems thinker who doesn't just fix problems but builds systems that prevent, detect, and autonomously remediate issues over time.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 149385615

Similar Jobs

Bengaluru, India

Skills:

Distributed SystemsNetworkingPrometheusBashGrafanaLinuxTerraformAzureKubernetesPythonAWSInfrastructure as CodeGo

Bengaluru, India

Skills:

ElkPrometheusGrafanaJenkinsTerraformKubernetesPythonInfrastructure as CodeGKEGoAKSEKSGitLab CIGitHub ActionsOpenTelemetryArgoCD