Overview
Agentic AI Observability Senior Engineer is responsible for deploying, integrating, and operating a scaled Agentic AI observability platform across both internal and external agent frameworks. This role focuses on production-ready instrumentation and telemetry pipelines that provide end-to-end visibility across multi-step agent workflows—including planner/executor loops, tool/function calls, RAG retrieval, and memory/state—ensuring reliability, safety, performance, and cost governance at scale
Responsibilities
- Agentic AI Observability at Scale (0%)
- Platform Deployment & Operations (Agentic AI Observability at Scale)
- Deploy and run the Agentic AI observability platform across dev/uat/prod with HA, resiliency, and controlled rollouts
- Implement release automation (CI/CD), canary deployments, rollback strategies, and configuration management for platform components
- Own operational readiness: on-call runbooks, incident response, and production support for agent observability services
- End-to-End Agent Workflow Tracing (Planner → Tools → Retrieval → Response)
- Implement distributed tracing for full agent execution graphs, including correlation across: prompts, intermediate reasoning steps (where permitted), tool calls, external APIs, retrieval queries, and final responses
- Enforce consistent trace context propagation, correlation IDs, and semantic conventions across agent services
- Build instrumentation patterns to represent agent flows as spans (e.g., plan span, tool span, retrieval span, memory span, response span)
- Agent Framework Integrations & Standardized Instrumentation
- Deploy and maintain integrations for internal agent frameworks and external ecosystems such as Crew.ai, LangChain, Semantic Kernel, AutoGen, and custom orchestrators
- Create reusable SDKs/middleware/sidecar patterns for teams to instrument agents with minimal effort
- Define and implement tagging standards for: agent name/version, tool name, model provider, prompt template ID, retrieval source, tenant/app, and environment
- Agentic AI Telemetry Pipelines & AI-Specific Signals
- Build scalable pipelines for agent telemetry (logs/metrics/traces) using OpenTelemetry and platform observability tooling
- Capture AI-specific metrics including: token usage, cost per task, tool-call latency, retrieval latency, grounding score proxies, error rates, and agent loop iterations
- Implement sampling and redaction strategies for sensitive agent payloads (prompts, responses, retrieved content) aligned to governance requirements
- Collaboration with Teams (10%)
- Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
- Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
- Integration & Deployment (10%)
- Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
- Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
- Drive best practices for secure, scalable, and cost-effective agent deployments
- Continuous Learning (10%)
- Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
- Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.
Qualifications
- Education: Bachelor's or Masters in Computer Science, AI/ML, Data Science, or a related field.
- Experience: 4-8+ years of software engineering experience; 2-3+ years building and observe AI/ML or GenAI applications preferred
- Required Expertise:
- Strong hands-on experience deploying observability solutions (Prometheus/Grafana/Elastic/Splunk/Datadog or equivalent)
- Deep working knowledge of OpenTelemetry instrumentation and telemetry pipeline operations
- Experience observing agentic AI systems: tool/function calls, orchestration, routing, memory/state, and RAG pipelines
- Familiarity with Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar agent frameworks
- Experience with evaluation/quality monitoring and safe logging strategies for LLM systems
- FinOps experience for tracking token and GPU spend, chargeback/showback, and cost anomaly detection
- Experience implementing data governance controls for AI telemetry (PII redaction, retention, auditability)
- Strong Kubernetes experience (AKS/EKS/GKE) including Helm, operators, ingress, and service networking
- Strong automation skills (Python/Bash/Go) and CI/CD experience
- Infrastructure-as-Code (Terraform/Bicep/CloudFormation)
- Agent workflow tracing and telemetry correlation
- Production operations and debugging distributed systems
- Observability-as-a-platform enablement and automation
- Strong documentation, collaboration, and stakeholder influence
- Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability default-on for all agentic services.
- Problem-Solving: Ability to translate business challenges into technical solutions.
- Collaboration Skills: Effective at working within cross-functional teams.
- Agility: Flexibility to adapt to changing requirements and new technologies.
- Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.