AI Solutions and Platforms Engineer

PepsiCo

Hyderabad, India

4-8 Years

Save

Posted 7 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Overview

Agentic AI Observability Senior Engineer is responsible for deploying, integrating, and operating a scaled Agentic AI observability platform across both internal and external agent frameworks. This role focuses on production-ready instrumentation and telemetry pipelines that provide end-to-end visibility across multi-step agent workflows—including planner/executor loops, tool/function calls, RAG retrieval, and memory/state—ensuring reliability, safety, performance, and cost governance at scale

Responsibilities

Agentic AI Observability at Scale (0%)
Platform Deployment & Operations (Agentic AI Observability at Scale)
Deploy and run the Agentic AI observability platform across dev/uat/prod with HA, resiliency, and controlled rollouts
Implement release automation (CI/CD), canary deployments, rollback strategies, and configuration management for platform components
Own operational readiness: on-call runbooks, incident response, and production support for agent observability services
End-to-End Agent Workflow Tracing (Planner → Tools → Retrieval → Response)
Implement distributed tracing for full agent execution graphs, including correlation across: prompts, intermediate reasoning steps (where permitted), tool calls, external APIs, retrieval queries, and final responses
Enforce consistent trace context propagation, correlation IDs, and semantic conventions across agent services
Build instrumentation patterns to represent agent flows as spans (e.g., plan span, tool span, retrieval span, memory span, response span)
Agent Framework Integrations & Standardized Instrumentation
Deploy and maintain integrations for internal agent frameworks and external ecosystems such as Crew.ai, LangChain, Semantic Kernel, AutoGen, and custom orchestrators
Create reusable SDKs/middleware/sidecar patterns for teams to instrument agents with minimal effort
Define and implement tagging standards for: agent name/version, tool name, model provider, prompt template ID, retrieval source, tenant/app, and environment
Agentic AI Telemetry Pipelines & AI-Specific Signals
Build scalable pipelines for agent telemetry (logs/metrics/traces) using OpenTelemetry and platform observability tooling
Capture AI-specific metrics including: token usage, cost per task, tool-call latency, retrieval latency, grounding score proxies, error rates, and agent loop iterations
Implement sampling and redaction strategies for sensitive agent payloads (prompts, responses, retrieved content) aligned to governance requirements

Collaboration with Teams (10%)

Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.

Integration & Deployment (10%)

Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
Drive best practices for secure, scalable, and cost-effective agent deployments

Continuous Learning (10%)

Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.

Qualifications

Education: Bachelor's or Masters in Computer Science, AI/ML, Data Science, or a related field.
Experience: 4-8+ years of software engineering experience; 2-3+ years building and observe AI/ML or GenAI applications preferred
Required Expertise:

Strong hands-on experience deploying observability solutions (Prometheus/Grafana/Elastic/Splunk/Datadog or equivalent)
Deep working knowledge of OpenTelemetry instrumentation and telemetry pipeline operations
Experience observing agentic AI systems: tool/function calls, orchestration, routing, memory/state, and RAG pipelines
Familiarity with Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar agent frameworks
Experience with evaluation/quality monitoring and safe logging strategies for LLM systems
FinOps experience for tracking token and GPU spend, chargeback/showback, and cost anomaly detection
Experience implementing data governance controls for AI telemetry (PII redaction, retention, auditability)
Strong Kubernetes experience (AKS/EKS/GKE) including Helm, operators, ingress, and service networking
Strong automation skills (Python/Bash/Go) and CI/CD experience
Infrastructure-as-Code (Terraform/Bicep/CloudFormation)
Agent workflow tracing and telemetry correlation
Production operations and debugging distributed systems
Observability-as-a-platform enablement and automation
Strong documentation, collaboration, and stakeholder influence
Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability default-on for all agentic services.
Problem-Solving: Ability to translate business challenges into technical solutions.
Collaboration Skills: Effective at working within cross-functional teams.
Agility: Flexibility to adapt to changing requirements and new technologies.
Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.