Search by job, company or skills

P

AI Solutions and Platforms Engineer

Save
new job description bg glownew job description bg glow
  • Posted 7 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Overview

Agentic AI Observability Senior Engineer is responsible for deploying, integrating, and operating a scaled Agentic AI observability platform across both internal and external agent frameworks. This role focuses on production-ready instrumentation and telemetry pipelines that provide end-to-end visibility across multi-step agent workflows—including planner/executor loops, tool/function calls, RAG retrieval, and memory/state—ensuring reliability, safety, performance, and cost governance at scale

Responsibilities

  • Agentic AI Observability at Scale (0%)
  • Platform Deployment & Operations (Agentic AI Observability at Scale)
  • Deploy and run the Agentic AI observability platform across dev/uat/prod with HA, resiliency, and controlled rollouts
  • Implement release automation (CI/CD), canary deployments, rollback strategies, and configuration management for platform components
  • Own operational readiness: on-call runbooks, incident response, and production support for agent observability services
  • End-to-End Agent Workflow Tracing (Planner → Tools → Retrieval → Response)
  • Implement distributed tracing for full agent execution graphs, including correlation across: prompts, intermediate reasoning steps (where permitted), tool calls, external APIs, retrieval queries, and final responses
  • Enforce consistent trace context propagation, correlation IDs, and semantic conventions across agent services
  • Build instrumentation patterns to represent agent flows as spans (e.g., plan span, tool span, retrieval span, memory span, response span)
  • Agent Framework Integrations & Standardized Instrumentation
  • Deploy and maintain integrations for internal agent frameworks and external ecosystems such as Crew.ai, LangChain, Semantic Kernel, AutoGen, and custom orchestrators
  • Create reusable SDKs/middleware/sidecar patterns for teams to instrument agents with minimal effort
  • Define and implement tagging standards for: agent name/version, tool name, model provider, prompt template ID, retrieval source, tenant/app, and environment
  • Agentic AI Telemetry Pipelines & AI-Specific Signals
  • Build scalable pipelines for agent telemetry (logs/metrics/traces) using OpenTelemetry and platform observability tooling
  • Capture AI-specific metrics including: token usage, cost per task, tool-call latency, retrieval latency, grounding score proxies, error rates, and agent loop iterations
  • Implement sampling and redaction strategies for sensitive agent payloads (prompts, responses, retrieved content) aligned to governance requirements
  • Collaboration with Teams (10%)
    • Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
    • Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
  • Integration & Deployment (10%)
    • Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
    • Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
    • Drive best practices for secure, scalable, and cost-effective agent deployments
  • Continuous Learning (10%)
    • Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
    • Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.
Qualifications

  • Education: Bachelor's or Masters in Computer Science, AI/ML, Data Science, or a related field.
  • Experience: 4-8+ years of software engineering experience; 2-3+ years building and observe AI/ML or GenAI applications preferred
  • Required Expertise:
    • Strong hands-on experience deploying observability solutions (Prometheus/Grafana/Elastic/Splunk/Datadog or equivalent)
    • Deep working knowledge of OpenTelemetry instrumentation and telemetry pipeline operations
    • Experience observing agentic AI systems: tool/function calls, orchestration, routing, memory/state, and RAG pipelines
    • Familiarity with Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar agent frameworks
    • Experience with evaluation/quality monitoring and safe logging strategies for LLM systems
    • FinOps experience for tracking token and GPU spend, chargeback/showback, and cost anomaly detection
    • Experience implementing data governance controls for AI telemetry (PII redaction, retention, auditability)
    • Strong Kubernetes experience (AKS/EKS/GKE) including Helm, operators, ingress, and service networking
    • Strong automation skills (Python/Bash/Go) and CI/CD experience
    • Infrastructure-as-Code (Terraform/Bicep/CloudFormation)
    • Agent workflow tracing and telemetry correlation
    • Production operations and debugging distributed systems
    • Observability-as-a-platform enablement and automation
    • Strong documentation, collaboration, and stakeholder influence
    • Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability default-on for all agentic services.
    • Problem-Solving: Ability to translate business challenges into technical solutions.
    • Collaboration Skills: Effective at working within cross-functional teams.
    • Agility: Flexibility to adapt to changing requirements and new technologies.
    • Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 148093373

Similar Jobs

Hyderabad, India

Skills:

KubernetesSplunkGrafanaDatadogPrometheusCloudformationPythonBashTerraformEKSAKSBicepGoGKEOpenTelemetryElastic

Hyderabad, India

Skills:

containerization SqlAgilePythonApi TestingPerformance TestingTest Automation FrameworksCloud PlatformsCI CD Integration