Search by job, company or skills

nexionpro services

Site Reliability Engineer

Save
  • Posted 21 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

SRE Observability Developer

Location: Hyderabad | Exp: 5–10 Years | Focus: Observability-as-Code & Automation

Role Overview

We are hiring an SRE Engineer to mature the observability and RCA capabilities of our high-scale UPI payment platforms. This is a hands-on, code-driven role focused on building reliable telemetry pipelines, transaction correlation, and automated alerting frameworks. You will treat monitoring configurations as code to ensure consistent, scalable operational intelligence.

Key Responsibilities

  • Telemetry Standardization: Build and standardize metrics, logs, and traces across app, middleware, and infra layers. Implement custom tags/attributes for unified drill-down dashboards.
  • Transaction Correlation: Enable correlation for asynchronous UPI flows to provide end-to-end visibility across distributed services.
  • SLO & Alert Engineering: Define Golden Signals and SLIs for critical journeys (P2P, P2M). Implement Alert-as-Code using config-based anomaly detection and noise-reduction logic.
  • Observability-as-Code: Automate the provisioning of Grafana dashboards, alert rules, and collector configurations (Otel/Fluentd) using version-controlled scripts.
  • RCA & Intelligence: Build RCA-focused views for Redis, Kafka, YugabyteDB, and Nginx. Use synthetic monitoring and black-box exporters to gain visibility into partially controlled systems.
  • Operational Integration: Convert incident learnings into automated telemetry patterns. Embed observability validation into deployment and release workflows.

Must-Have Skills

1. Observability Stack

  • Expertise: Prometheus/Victoria Metrics, Victoria Logs/Traces, OpenTelemetry (OTel), and Fluentd.
  • Tooling: Advanced Grafana, Alertmanager, and various infrastructure exporters.
  • Development: Ability to develop Custom Exporters using OpenTelemetry SDKs for unique business/transaction metrics.

2. Systems & Middleware

  • Knowledge: Deep understanding of Redis, Kafka, Nginx, and YugabyteDB (or similar distributed DBs).
  • App Tier: Proficiency with JVM/Spring Boot Actuator metrics and asynchronous request/response patterns.
  • Environment: Experience with high-scale, low-latency platforms; UPI/Payments domain is highly preferred.

3. Scripting & Automation

  • Core Skills: Strong Python and Shell/Bash for automating telemetry validation and collector lifecycle management.
  • Mindset: Ability to treat all monitoring assets (dashboards, rules, configs) as code artifacts.

What We're Looking For

  • An engineer who sees a dashboard as a product of code, not just a UI task.
  • Strong debugging skills across complex, on-prem distributed systems.
  • The ability to bridge the gap between what happened and where the code failed through advanced correlation.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148892661

Similar Jobs

Hyderabad, India

Skills:

JavaMicroservice architectureMqPostgreSQLSpring BootKafkaJIRAJenkinsGcpOpenshiftKuberneteslogging toolsOAQTektonChaos Engineering conceptsprivate public key managementGitHub ActionsCloud WAF securityCamunda process orchestration engine

Hyderabad, India

Skills:

Windows ServicesWindows ServerGcpElkLinuxIisPowerShellAzurePythonAWSActive Directory

Hyderabad, India

Skills:

CeleryDockerTerraformCosmos DBPostgres SqlPowerShellBashItilDatadogSqlArmKubernetesChecklyLog AnalyticsOpenTelemetryOpenAI APIsBicepApplication InsightsLangChainMicrosoft Azure CloudAI ML-based anomaly detectionPlaywrightKustoAzure Monitor

Hyderabad, India

Skills:

Incident ResponseAWSShell scriptingPythonBashKubernetesGoContainer orchestrationLinux Unix systems administrationInfrastructure automationSite Reliability Engineering

Hyderabad, India

Skills:

ElkPowerShellPrometheusBashGrafanaDatadogZabbixGcpDockerTerraformAnsibleSplunkNagiosPuppetAzureKubernetesPythonAWSChefLinux Unix system administrationGoIstio