Key Responsibilities
Platform Design & Architecture
- Define and evolve the architecture of observability platform, integrating logs, metrics, traces, events, and alerts
- Establish reference implementations and patterns for integrating observability into cloud-native and monolithic applications
- Evaluate and integrate best-in-class tools for telemetry (e.g., OpenTelemetry, Prometheus, New Relic, Grafana, Elastic, Splunk, etc.)
Governance & Standards
- Define enterprise-wide observability standards and maturity models (instrumentation guidelines, SLOs/SLIs, retention policies)
- Drive instrumentation consistency across services through libraries, SDKs, and developer onboarding assets
- Embed observability standards into CI/CD pipelines, golden paths, and developer enablement frameworks
Platform Engineering & Operations
- Build and maintain core observability infrastructure as internal platform services
- Ensure observability platform is highly available, scalable, cost-optimized, and compliant with governance controls
- Automate provisioning, onboarding, alerting configuration, and tenant lifecycle management for internal teams
Developer Enablement & Integration
- Create self-service capabilities for developers and SREs:
- Instrumentation kits
- Dashboards and alert templates
- Troubleshooting guides and observability sandboxes
- Collaborate with Developer Experience and Platform teams to embed observability into the developer workflow and developer portal (Velocity)
Adoption & Support
- Lead and support migration and onboarding efforts for application teams
- Partner with GPS, ISS, and platform teams to define key use cases and integration paths
- Define telemetry baselines and observability KPIs for portfolio-level measurement
Required:
- 6+ years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles
- Deep understanding of observability concepts (logs, metrics, traces, events, SLOs, SLIs, RED/USE models)
- Hands-on experience with one or more tools in the observability stack (Grafana, Elastic, Prometheus, Splunk, Datadog, OpenTelemetry)
- Strong scripting or automation skills (Python, Go, Bash, Terraform, etc.)
- Familiarity with Kubernetes, container orchestration, and cloud-native environments (AWS/Azure)
Preferred:
- Experience designing or operating an enterprise-wide observability platform
- Exposure to multi-tenant observability systems, billing or usage metering
- Knowledge of developer experience workflows and developer portals
- Previous work with standards enforcement and governance-as-code