Observability Engineer - Site Reliability

NetConnect AS

Gurugram, Gurugram, India

6-9 Years

Save

Posted 2 months ago
Be among the first 10 applicants

Early Applicant

Job Description

Description

Location : Bangalore, Gurgaon, Pune, Mumbai, Delhi, Chennai, Hyderabad, Noida

Experience : 6 - 9 Years

CTC : 12 - 19 LPA

Notice Period : Immediate to 15 Days

Role Overview

We are seeking an experienced Observability Engineer to design, build, and operate the observability foundation for complex, distributed systems. This role focuses on enabling engineering teams to understand, troubleshoot, and optimize systems using high-quality metrics, logs, traces, and insights.

As an Observability Engineer, you will build the nervous system of the platformdeveloping scalable telemetry pipelines, defining standards, and empowering teams with actionable visibility. You will work across application, platform, SRE, and infrastructure teams to ensure systems are reliable, performant, cost-efficient, and debuggable at scale.

Key Roles & Responsibilities

Observability Strategy & Architecture :

Define and drive the organizations observability strategy, standards, and roadmap.
Design comprehensive telemetry architectures for distributed and microservices-based systems.
Establish best practices and guidelines for metrics, logging, and tracing.
Evaluate, select, and standardize observability tools and platforms.
Create reference architectures for instrumentation across multiple technology stacks.
Partner with engineering teams to define SLIs, SLOs, and error budgets.

Instrumentation & Telemetry Engineering

Instrument applications and services with metrics, logs, and distributed traces.
Implement end-to-end distributed tracing across microservices architectures.
Deploy and configure telemetry agents, sidecars, and collectors.
Implement OpenTelemetry standards, SDKs, and Collector pipelines.
Build custom instrumentation libraries and SDKs across multiple languages.
Create auto-instrumentation frameworks to reduce developer effort.
Ensure semantic consistency and data quality across all telemetry signals.

Observability Platforms & Tooling

Deploy, manage, and optimize metrics platforms such as :

Prometheus, Grafana, Datadog, New Relic, Dynatrace, AppDynamics
Cloud-native platforms (AWS CloudWatch, Azure Monitor, GCP Monitoring)
Long-term storage solutions (Thanos, Mimir, VictoriaMetrics)

Deploy And Manage Logging Platforms Including

ELK Stack, Splunk, Loki, Fluentd, Sumo Logic
Cloud-native logging (CloudWatch Logs, Azure Log Analytics, GCP Logging)

Deploy And Manage Distributed Tracing Tools Such As

Jaeger, Zipkin, Datadog APM, New Relic APM, Dynatrace, Lightstep
Optimize observability platforms for performance, scalability, and cost.

Dashboards, Alerting & Incident Enablement

Design and build comprehensive dashboards :
Service-level dashboards with Golden Signals (latency, traffic, errors, saturation)
Executive dashboards for SLO compliance and business KPIs
Real-time operational and on-call dashboards
Design intelligent alerting strategies to reduce alert fatigue.
Implement multi-signal alert correlation, anomaly detection, and adaptive thresholds.
Integrate with incident management tools (PagerDuty, Opsgenie, VictorOps).
Configure alert routing, escalation policies, suppression, and maintenance windows.
Enable self-healing automation triggered by alerts.

Logging & Trace Engineering

Design and implement centralized logging architectures.
Build log ingestion, parsing, enrichment, and normalization pipelines.
Define structured logging standards (JSON, key-value).
Implement log sampling and retention strategies for high-volume systems.
Create log-based metrics and alerts.
Ensure data privacy, compliance, and retention policies are enforced.
Implement trace sampling strategies to balance cost and visibility.

Performance Analysis & Optimization

Conduct deep-dive performance investigations using telemetry data.
Identify bottlenecks, latency contributors, and error propagation paths.
Build capacity planning models using observability data.
Analyze resource utilization (CPU, memory, disk, network).
Create cost attribution and optimization insights from telemetry.
Map service dependencies and request flows across distributed systems.

Telemetry Pipelines & Cost Optimization

Build and optimize telemetry data pipelines (filtering, routing, transformation).
Manage cardinality, storage costs, and data volumes effectively.
Implement sampling, aggregation, and retention strategies.
Ensure high data quality and completeness.
Build export pipelines for analytics, compliance, and archival use cases.

Enablement, Automation & DevEx

Build self-service observability frameworks and tooling.
Integrate observability into CI/CD pipelines (Observability-as-Code).
Automate dashboard and alert provisioning.
Develop APIs, plugins, and extensions for observability platforms.
Create documentation, tutorials, templates, and best-practice guides.
Conduct training sessions and provide observability consulting to teams.
Participate in code reviews to validate instrumentation quality.

Required Skills & Experience

Core Observability Expertise :

Strong understanding of metrics types (counters, gauges, histograms, summaries).
Deep expertise in PromQL and time-series data modeling.
Strong knowledge of logging pipelines, parsing (Grok/Regex/JSON), and SPL.
Deep understanding of distributed tracing concepts, context propagation, and sampling.
Hands-on experience with OpenTelemetry specifications and implementations.

Programming & Platforms

Strong proficiency in Python, Go, Java, or Node.js.
Ability to instrument and read code across multiple languages.
Experience building custom instrumentation libraries and APIs.
Familiarity with Kafka, Fluentd, Logstash, or similar data pipelines.
Experience with AWS, Azure, or GCP environments.
Strong understanding of Kubernetes and container observability.

Professional Experience

69 years of experience in observability, SRE, platform engineering, or performance engineering.
Proven experience building observability platforms at scale.
Experience managing high-cardinality data and observability cost optimization.
Strong troubleshooting background in complex distributed systems.

Soft Skills & Mindset