Company Overview
Trianz is an applied AI solutions company that accelerates customer business transformation through AI powered Transformation Services as a Software Model. With 25+ years of transforming enterprises, we've evolved to a product-led, platform-driven organization serving global enterprises across Financial Services, Insurance, Healthcare, Hi-Tech, Manufacturing, and other industries.
With global presence across 4 continents, our platform portfolio under the unified Concierto brand delivers end-to-end transformations including solutions for Migrate, Manage, Maximize, Modernize, Insights & Agentic AI, and SecOps - delivered through strategic partnerships with leading hyperscalers.
We're building the premier innovation-led organization in the digital transformation space through AI-first methodologies and data-driven excellence - RevolutionAIzing Transformations.
Role Overview
We're building the observability backbone application supports multi-tenant AIOps platform transforming how enterprises monitor, alert, and respond to incidents.
You'll own the alert ingestion pipeline, metrics architecture, log management strategy, and distributed tracing framework that powers production-grade observability for Fortune 500 customers and MSPs.
This isn't configuring tools. This is engineering observability features - from design to deployment.
KEY RESPONSIBILITIES
- Own the observability architecture for Concierto: define standards for metrics collection, structured logging, distributed tracing, and alerting across cloud and on-premises workloads
- Engineer microservice features in Nodejs (and/or Python) that power alert ingestion, deduplication, correlation, and root-cause analysis within the platform's alert-processor service
- Integrate and extend open-source observability stacks Prometheus, Grafana, Open Telemetry, Loki, Tempo, Alert manager within Kubernetes (EKS) environments
- Design and implement alert correlation rules, noise-reduction algorithms, and intelligent routing logic to reduce MTTR for platform customers
- Define SLIs, SLOs, and error budgets for the platform itself, and build dashboards that surface reliability signals to engineering and customer stakeholders
- Collaborate with AI/Bedrock engineering to feed observability telemetry into AIOps models for anomaly detection and predictive alerting
- Evaluate and recommend commercial or open-source tooling additions (eg, Thanos, Victoria Metrics, Jaeger, Pixie) and lead POC implementations
- Act as the go-to SME for monitoring gaps, alerting fidelity, or observability integration questions
- Conduct knowledge-transfer sessions, write technical RFCs, and author platform observability documentation
Ideal Candidate Profile
Experience
- 7+ years of software engineering experience
- 2+ years of hands-on experience Observability & Monitoring
- Expert-level knowledge of the three pillars of observability: metrics, logs, and traces
- Deep hands-on experience with Prometheus (PromQL, recording rules, alert rules) and Grafana (dashboards, provisioning, data sources)
- Proficiency with OpenTelemetry (OTEL) SDK instrumentation in Nodejs and/or Python services
- Experience with log aggregation: Loki, Elasticsearch / OpenSearch, Fluentd/Fluent Bit, or similar
- Distributed tracing with Tempo, Jaeger, Zipkin, or AWS X-Ray
- Alert management: Alertmanager, PagerDuty, or ServiceNow ITSM event routing integration
REQUIRED SKILLS & QUALIFICATIONS
Product Engineering
- Hands-on backend engineering in Nodejs (Express / Fastify) or Python (FastAPI / Flask) for building production microservices
- Experience designing event-driven pipelines using Apache Kafka, AWS SQS/SNS, or similar message brokers
- Proficiency with relational databases (PostgreSQL): query optimisation, schema design, and index tuning
- RESTful and/or gRPC API design and integration skills
- Familiarity with containerisation (Docker) and Kubernetes workload management
Cloud & Platform
- Hands-on experience deploying and operating observability stacks on AWS (CloudWatch, managed Prometheus, managed Grafana) and/or Azure Monitor
- EKS / AKS cluster-level observability: node exporter, kube-state-metrics, cAdvisor, cost metrics
- Understanding of multi-tenant data isolation requirements in observability platforms
NICE TO HAVE
- Experience with AI/ML-enhanced observability: anomaly detection, predictive alerting, or log intelligence (eg, AWS Bedrock, Azure AI, or OSS MLflow integration)
- Familiarity with eBPF-based observability tools (Cilium, Pixie, Hubble)
- Exposure to commercial APM tools: Dynatrace, Datadog, New Relic, AppDynamics
- Knowledge of SRE practices: chaos engineering, game days, incident post-mortem facilitation
- Experience in MSP or multi-tenant SaaS product environments
Why choose Trianz
- Personal Growth: Startup agility with enterprise impact. Experience rapid innovation cycles while working on Fortune 500 transformations.
- AI-First Future: Where humans and AI revolutionize business. Lead the charge in implementing AI-driven transformation at scale.
- Global Impact: Shape transformation across continents. Work with diverse teams and clients spanning Americas, Europe, Asia, and beyond.
- Executive Access: Direct impact on Fortune 500 strategies. Work alongside C-suite leaders and influence major business decisions.
- Ownership & Entrepreneurial Spirit: Your ideas, our platform, global impact. Zero bureaucracy culture with decision-making autonomy and rapid execution.