Search by job, company or skills

ITC Infotech

Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Key Responsibilities

Platform Design & Architecture

  • Define and evolve the architecture of observability platform, integrating logs, metrics, traces, events, and alerts
  • Establish reference implementations and patterns for integrating observability into cloud-native and monolithic applications
  • Evaluate and integrate best-in-class tools for telemetry (e.g., OpenTelemetry, Prometheus, New Relic, Grafana, Elastic, Splunk, etc.)

Governance & Standards

  • Define enterprise-wide observability standards and maturity models (instrumentation guidelines, SLOs/SLIs, retention policies)
  • Drive instrumentation consistency across services through libraries, SDKs, and developer onboarding assets
  • Embed observability standards into CI/CD pipelines, golden paths, and developer enablement frameworks

Platform Engineering & Operations

  • Build and maintain core observability infrastructure as internal platform services
  • Ensure observability platform is highly available, scalable, cost-optimized, and compliant with governance controls
  • Automate provisioning, onboarding, alerting configuration, and tenant lifecycle management for internal teams

Developer Enablement & Integration

  • Create self-service capabilities for developers and SREs:
  • Instrumentation kits
  • Dashboards and alert templates
  • Troubleshooting guides and observability sandboxes
  • Collaborate with Developer Experience and Platform teams to embed observability into the developer workflow and developer portal (Velocity)

Adoption & Support

  • Lead and support migration and onboarding efforts for application teams
  • Partner with GPS, ISS, and platform teams to define key use cases and integration paths
  • Define telemetry baselines and observability KPIs for portfolio-level measurement

Required:

  • 6+ years of experience in Site Reliability Engineering, Platform Engineering, or DevOps roles
  • Deep understanding of observability concepts (logs, metrics, traces, events, SLOs, SLIs, RED/USE models)
  • Hands-on experience with one or more tools in the observability stack (Grafana, Elastic, Prometheus, Splunk, Datadog, OpenTelemetry)
  • Strong scripting or automation skills (Python, Go, Bash, Terraform, etc.)
  • Familiarity with Kubernetes, container orchestration, and cloud-native environments (AWS/Azure)

Preferred:

  • Experience designing or operating an enterprise-wide observability platform
  • Exposure to multi-tenant observability systems, billing or usage metering
  • Knowledge of developer experience workflows and developer portals
  • Previous work with standards enforcement and governance-as-code

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 142222295

Similar Jobs