Search by job, company or skills

Trianz

Monitoring Specialist

Save
  • Posted 21 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Company Overview

Trianz is an applied AI solutions company that accelerates customer business transformation through AI powered Transformation Services as a Software Model. With 25+ years of transforming enterprises, we've evolved to a product-led, platform-driven organization serving global enterprises across Financial Services, Insurance, Healthcare, Hi-Tech, Manufacturing, and other industries.

With global presence across 4 continents, our platform portfolio under the unified Concierto brand delivers end-to-end transformations including solutions for Migrate, Manage, Maximize, Modernize, Insights & Agentic AI, and SecOps - delivered through strategic partnerships with leading hyperscalers.

We're building the premier innovation-led organization in the digital transformation space through AI-first methodologies and data-driven excellence - RevolutionAIzing Transformations.

Role Overview

We're building the observability backbone application supports multi-tenant AIOps platform transforming how enterprises monitor, alert, and respond to incidents.

You'll own the alert ingestion pipeline, metrics architecture, log management strategy, and distributed tracing framework that powers production-grade observability for Fortune 500 customers and MSPs.

This isn't configuring tools. This is engineering observability features - from design to deployment.

KEY RESPONSIBILITIES

  • Own the observability architecture for Concierto: define standards for metrics collection, structured logging, distributed tracing, and alerting across cloud and on-premises workloads
  • Engineer microservice features in Nodejs (and/or Python) that power alert ingestion, deduplication, correlation, and root-cause analysis within the platform's alert-processor service
  • Integrate and extend open-source observability stacks Prometheus, Grafana, Open Telemetry, Loki, Tempo, Alert manager within Kubernetes (EKS) environments
  • Design and implement alert correlation rules, noise-reduction algorithms, and intelligent routing logic to reduce MTTR for platform customers
  • Define SLIs, SLOs, and error budgets for the platform itself, and build dashboards that surface reliability signals to engineering and customer stakeholders
  • Collaborate with AI/Bedrock engineering to feed observability telemetry into AIOps models for anomaly detection and predictive alerting
  • Evaluate and recommend commercial or open-source tooling additions (eg, Thanos, Victoria Metrics, Jaeger, Pixie) and lead POC implementations
  • Act as the go-to SME for monitoring gaps, alerting fidelity, or observability integration questions
  • Conduct knowledge-transfer sessions, write technical RFCs, and author platform observability documentation

Ideal Candidate Profile

Experience

  • 7+ years of software engineering experience
  • 2+ years of hands-on experience Observability & Monitoring
  • Expert-level knowledge of the three pillars of observability: metrics, logs, and traces
  • Deep hands-on experience with Prometheus (PromQL, recording rules, alert rules) and Grafana (dashboards, provisioning, data sources)
  • Proficiency with OpenTelemetry (OTEL) SDK instrumentation in Nodejs and/or Python services
  • Experience with log aggregation: Loki, Elasticsearch / OpenSearch, Fluentd/Fluent Bit, or similar
  • Distributed tracing with Tempo, Jaeger, Zipkin, or AWS X-Ray
  • Alert management: Alertmanager, PagerDuty, or ServiceNow ITSM event routing integration

REQUIRED SKILLS & QUALIFICATIONS

Product Engineering

  • Hands-on backend engineering in Nodejs (Express / Fastify) or Python (FastAPI / Flask) for building production microservices
  • Experience designing event-driven pipelines using Apache Kafka, AWS SQS/SNS, or similar message brokers
  • Proficiency with relational databases (PostgreSQL): query optimisation, schema design, and index tuning
  • RESTful and/or gRPC API design and integration skills
  • Familiarity with containerisation (Docker) and Kubernetes workload management

Cloud & Platform

  • Hands-on experience deploying and operating observability stacks on AWS (CloudWatch, managed Prometheus, managed Grafana) and/or Azure Monitor
  • EKS / AKS cluster-level observability: node exporter, kube-state-metrics, cAdvisor, cost metrics
  • Understanding of multi-tenant data isolation requirements in observability platforms

NICE TO HAVE

  • Experience with AI/ML-enhanced observability: anomaly detection, predictive alerting, or log intelligence (eg, AWS Bedrock, Azure AI, or OSS MLflow integration)
  • Familiarity with eBPF-based observability tools (Cilium, Pixie, Hubble)
  • Exposure to commercial APM tools: Dynatrace, Datadog, New Relic, AppDynamics
  • Knowledge of SRE practices: chaos engineering, game days, incident post-mortem facilitation
  • Experience in MSP or multi-tenant SaaS product environments

Why choose Trianz

  • Personal Growth: Startup agility with enterprise impact. Experience rapid innovation cycles while working on Fortune 500 transformations.
  • AI-First Future: Where humans and AI revolutionize business. Lead the charge in implementing AI-driven transformation at scale.
  • Global Impact: Shape transformation across continents. Work with diverse teams and clients spanning Americas, Europe, Asia, and beyond.
  • Executive Access: Direct impact on Fortune 500 strategies. Work alongside C-suite leaders and influence major business decisions.
  • Ownership & Entrepreneurial Spirit: Your ideas, our platform, global impact. Zero bureaucracy culture with decision-making autonomy and rapid execution.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149766387

Similar Jobs

Bengaluru, India

Skills:

GrafanaDynatraceAWS CloudWatchITOMAIOpsMoogsoftEvent Management SystemsAzure MonitorBig PandaElasticGoogle Cloud Operations Suite