Search by job, company or skills

P

AI Solutions and Platforms Operations Engineer

Save
new job description bg glownew job description bg glow
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Overview

The AIObservabilityEngineer (Agentic Frameworks & AI Agent Operations Center Developer) builds and operationalizes agentic AI solutions using modern orchestration frameworks and contributes to an AI Agent Operations Center that enables safe, reliable, and observable agent behavior at scale. This role focuses on developing agent workflows (planning, tool execution, memory, and RAG), integrating guardrails and evaluations, and delivering operational capabilities such asrunmanagement, telemetry, and incident triage for production agents.


Responsibilities

  1. AI Agent Operations Center (70%)
    • Build operations center capabilities for agent runtime management: agent registry, versioning, deployment tracking, and run histories
    • Enable operational workflows such as incident triage, replay/debug runs, trace correlation, and root-cause analysis across agent steps
    • Implement operational dashboards and views for agent health: success rate, latency, tool failure rate, cost per run, and loop detection
    • Instrument agent flows end-to-end usingOpenTelemetry(or equivalent), enabling correlation across prompts, tool calls, retrieval, and responses
    • Implement semantic conventions and tagging standards (agent name/version, tool name, model provider, environment, tenant/app)
    • Partner with SRE/observability teams to ensure production-grade monitoring, alerting, and operational readiness
  2. Collaboration with Teams (10%)
    • Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
    • Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
  3. Integration & Deployment (10%)
    • Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
    • Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
    • Drive best practices for secure, scalable, and cost-effective agent deployments
  4. Continuous Learning (10%)
    • Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
    • Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.

Qualifications

Key Skills/Experience RequiredMinimum Qualifications:

  • Education: Bachelor's in Computer Science, AI/ML, Data Science, or a related field.
  • Experience: 3-5+ years of software engineering experience 1+ years building and observe AI/ML or GenAI applications preferred
  • Required Expertise:
    • Hands-on experience withagentic frameworks(Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar)
    • Proficiency inPython(primary) and familiarity with APIs/microservices patterns
    • Strong experience withRAGpatterns (embeddings, vector search, retrieval evaluation, chunking strategies)
    • Experience with cloud environments (Azure/AWS/GCP) and containerized deployments (Kubernetes/AKS/EKS)
    • Familiarity with observability fundamentals (logs/metrics/traces) and production troubleshooting
    • Experience building internal developer platforms or operational consoles (agent registry, run tracking, dashboards)
    • Familiarity with OpenTelemetry, distributed tracg, and telemetry pipelines
    • Experience with Azure AI Search / vector databases, prompt/version management, and evaluation frameworks
    • Knowledge of Responsible AI practices: data handling, safety guardrails, audit trails, and redaction strategies
    • FinOps exposure: token/GPU cost optimization and chargeback/showback reporting
  • Technical Proficiency: Agent orchestration design (planning, tool execution, memory, RAG), Strong engineering discipline: testing, versioning, CI/CD, automation, Operational mindset: reliability, debuggability, and incident response support
  • Problem-Solving: Ability to translate business challenges into technical solutions.
  • Collaboration Skills: Effective at working within cross-functional teams.
  • Agility: Flexibility to adapt to changing requirements and new technologies.
  • Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.

More Info

Job Type:
Function:
Employment Type:

About Company

PepsiCo, Inc. is an American multinational food, snack, and beverage corporation headquartered in Harrison, New York, in the hamlet of Purchase. PepsiCo's business encompasses all aspects of the food and beverage market. It oversees the manufacturing, distribution, and marketing of its products. PepsiCo was formed in 1965 with the merger of the Pepsi-Cola Company and Frito-Lay, Inc. PepsiCo has since expanded from its namesake product Pepsi Cola to an immensely diversified range of food and beverage brands. The largest and most recent acquisition was Pioneer Foods in 2020 for $1.7bn [3] and before that it was the Quaker Oats Company in 2001, which added the Gatorade brand to the Pepsi portfolio and Tropicana Products in 1998.

Job ID: 147711469