Site Reliability Engineer

6-8 Years

Save

Early Applicant

Job Description

Key Responsibilities

Platform Design & Architecture

Define and evolve the architecture of observability platform, integrating logs, metrics, traces, events, and alerts
Establish reference implementations and patterns for integrating observability into cloud-native and monolithic applications
Evaluate and integrate best-in-class tools for telemetry (e.g., OpenTelemetry, Prometheus, New Relic, Grafana, Elastic, Splunk, etc.)

Governance & Standards

Define enterprise-wide observability standards and maturity models (instrumentation guidelines, SLOs/SLIs, retention policies)
Drive instrumentation consistency across services through libraries, SDKs, and developer onboarding assets
Embed observability standards into CI/CD pipelines, golden paths, and developer enablement frameworks

Platform Engineering & Operations

Build and maintain core observability infrastructure as internal platform services
Ensure observability platform is highly available, scalable, cost-optimized, and compliant with governance controls
Automate provisioning, onboarding, alerting configuration, and tenant lifecycle management for internal teams

Developer Enablement & Integration

Create self-service capabilities for developers and SREs:
Instrumentation kits
Dashboards and alert templates
Troubleshooting guides and observability sandboxes
Collaborate with Developer Experience and Platform teams to embed observability into the developer workflow and developer portal (Velocity)

Adoption & Support

Lead and support migration and onboarding efforts for application teams
Partner with GPS, ISS, and platform teams to define key use cases and integration paths
Define telemetry baselines and observability KPIs for portfolio-level measurement

Required:

6+ years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles
Deep understanding of observability concepts (logs, metrics, traces, events, SLOs, SLIs, RED/USE models)
Hands-on experience with one or more tools in the observability stack (Grafana, Elastic, Prometheus, Splunk, Datadog, OpenTelemetry)
Strong scripting or automation skills (Python, Go, Bash, Terraform, etc.)
Familiarity with Kubernetes, container orchestration, and cloud-native environments (AWS/Azure)

Preferred: