Monitoring Specialist

Trianz

Bengaluru, India

7-9 Years

Save

Posted 21 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Company Overview

Trianz is an applied AI solutions company that accelerates customer business transformation through AI powered Transformation Services as a Software Model. With 25+ years of transforming enterprises, we've evolved to a product-led, platform-driven organization serving global enterprises across Financial Services, Insurance, Healthcare, Hi-Tech, Manufacturing, and other industries.

With global presence across 4 continents, our platform portfolio under the unified Concierto brand delivers end-to-end transformations including solutions for Migrate, Manage, Maximize, Modernize, Insights & Agentic AI, and SecOps - delivered through strategic partnerships with leading hyperscalers.

We're building the premier innovation-led organization in the digital transformation space through AI-first methodologies and data-driven excellence - RevolutionAIzing Transformations.

Role Overview

We're building the observability backbone application supports multi-tenant AIOps platform transforming how enterprises monitor, alert, and respond to incidents.

You'll own the alert ingestion pipeline, metrics architecture, log management strategy, and distributed tracing framework that powers production-grade observability for Fortune 500 customers and MSPs.

This isn't configuring tools. This is engineering observability features - from design to deployment.

KEY RESPONSIBILITIES

Own the observability architecture for Concierto: define standards for metrics collection, structured logging, distributed tracing, and alerting across cloud and on-premises workloads
Engineer microservice features in Nodejs (and/or Python) that power alert ingestion, deduplication, correlation, and root-cause analysis within the platform's alert-processor service
Integrate and extend open-source observability stacks Prometheus, Grafana, Open Telemetry, Loki, Tempo, Alert manager within Kubernetes (EKS) environments
Design and implement alert correlation rules, noise-reduction algorithms, and intelligent routing logic to reduce MTTR for platform customers
Define SLIs, SLOs, and error budgets for the platform itself, and build dashboards that surface reliability signals to engineering and customer stakeholders
Collaborate with AI/Bedrock engineering to feed observability telemetry into AIOps models for anomaly detection and predictive alerting
Evaluate and recommend commercial or open-source tooling additions (eg, Thanos, Victoria Metrics, Jaeger, Pixie) and lead POC implementations
Act as the go-to SME for monitoring gaps, alerting fidelity, or observability integration questions
Conduct knowledge-transfer sessions, write technical RFCs, and author platform observability documentation

Ideal Candidate Profile

Experience

7+ years of software engineering experience
2+ years of hands-on experience Observability & Monitoring
Expert-level knowledge of the three pillars of observability: metrics, logs, and traces
Deep hands-on experience with Prometheus (PromQL, recording rules, alert rules) and Grafana (dashboards, provisioning, data sources)
Proficiency with OpenTelemetry (OTEL) SDK instrumentation in Nodejs and/or Python services
Experience with log aggregation: Loki, Elasticsearch / OpenSearch, Fluentd/Fluent Bit, or similar
Distributed tracing with Tempo, Jaeger, Zipkin, or AWS X-Ray
Alert management: Alertmanager, PagerDuty, or ServiceNow ITSM event routing integration

REQUIRED SKILLS & QUALIFICATIONS

Product Engineering

Hands-on backend engineering in Nodejs (Express / Fastify) or Python (FastAPI / Flask) for building production microservices
Experience designing event-driven pipelines using Apache Kafka, AWS SQS/SNS, or similar message brokers
Proficiency with relational databases (PostgreSQL): query optimisation, schema design, and index tuning
RESTful and/or gRPC API design and integration skills
Familiarity with containerisation (Docker) and Kubernetes workload management

Cloud & Platform

Hands-on experience deploying and operating observability stacks on AWS (CloudWatch, managed Prometheus, managed Grafana) and/or Azure Monitor
EKS / AKS cluster-level observability: node exporter, kube-state-metrics, cAdvisor, cost metrics
Understanding of multi-tenant data isolation requirements in observability platforms

NICE TO HAVE

Experience with AI/ML-enhanced observability: anomaly detection, predictive alerting, or log intelligence (eg, AWS Bedrock, Azure AI, or OSS MLflow integration)
Familiarity with eBPF-based observability tools (Cilium, Pixie, Hubble)
Exposure to commercial APM tools: Dynatrace, Datadog, New Relic, AppDynamics
Knowledge of SRE practices: chaos engineering, game days, incident post-mortem facilitation
Experience in MSP or multi-tenant SaaS product environments

Why choose Trianz

Personal Growth: Startup agility with enterprise impact. Experience rapid innovation cycles while working on Fortune 500 transformations.
AI-First Future: Where humans and AI revolutionize business. Lead the charge in implementing AI-driven transformation at scale.
Global Impact: Shape transformation across continents. Work with diverse teams and clients spanning Americas, Europe, Asia, and beyond.
Executive Access: Direct impact on Fortune 500 strategies. Work alongside C-suite leaders and influence major business decisions.
Ownership & Entrepreneurial Spirit: Your ideas, our platform, global impact. Zero bureaucracy culture with decision-making autonomy and rapid execution.