Search by job, company or skills

P

AI Solution and Platform Sr. Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 20 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Overview

The AI Observability Senior Engineer (L9) is a seasoned individual contributor who partners alongside Jr. AI Observability Architects to deliver high-quality, production-grade observability capabilities across the enterprise AI platform. This role brings deeper technical experience and greater independent execution capability- owning delivery across one or two specialization tracks with reduced need for day-to-day direction - while working as a genuine peer within the team rather than in a supervisory or mentorship capacity.

The Sr. Engineer is expected to be a strong, self-sufficient technical contributor who can take a complex observability requirement from design through implementation and into production operation within their assigned tracks. They bring cross-track awareness that helps the team as a whole, contribute to engineering standard discussions, and collaborate with peer architects on solving shared technical challenges - all without requiring supervisory authority.


Responsibilities

1. Observability Platform Engineering & OTEL Integration (25%)

  • Design and implement OpenTelemetry (OTEL) instrumentation within one or two assigned agent frameworks or platforms - including custom exporters, span enrichers, semantic convention tagging, and distributed trace context propagation - with the ability to work independently from requirements through to production deployment.
  • Build and maintain telemetry pipeline components (collectors, processors, exporters) that reliably route metrics, logs, traces, and semantic signals to observability backends - owning the full lifecycle of assigned pipeline components including testing, deployment, and on-call support.
  • Contribute to the integration of OTEL with enterprise agentic platforms (Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks) within the assigned scope - implementing instrumentation to architecture patterns established by the L11.
  • Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for assigned sub-domains - validating signal quality and tuning alert thresholds to achieve low false-positive rates.
  • Participate in on-call rotations and production incident response - contributing to RCA documentation, runbook authoring, and post-incident improvement actions.
  • Write comprehensive unit, integration, and end-to-end tests for all telemetry components owned maintain 80% test coverage across assigned services and proactively identify gaps in existing coverage.

2. Safety, Security & Red Teaming Observability (15%)

  • Implement safety-critical signal capture within assigned telemetry pipelines - building reliable instrumentation for guardrail trigger rates, policy violation events, adversarial detection flags, hallucination indicators, and trust boundary crossing alerts.
  • Build observability components that support red team exercises - instrumenting assigned agent systems to capture adversarial test events, behavioral deviations, and attack surface signals in a measurable, repeatable way.
  • Implement secure trace handling patterns within assigned pipelines - applying data masking, PII redaction, and audit-log retention configurations as specified by the security architecture.
  • Contribute to the Security Observability Playbook - documenting assigned instrumentation patterns, updating escalation procedures based on observed incidents, and maintaining accuracy of the playbook sections within scope.
  • Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) within the assigned domain for anomalous patterns - flagging deviations for review in a timely manner with sufficient diagnostic context.

3. Responsible AI (RAI) & Governance Signal Instrumentation (10%)

  • Implement RAI signal collectors within assigned agent workflows - building reliable pipelines that capture fairness indicators, bias detection outputs, explainability scores, and content safety classifications with validated data quality.
  • Maintain RAI telemetry pipelines within scope - ensuring completeness, accuracy, and timeliness of governance signals that feed into compliance dashboards, and resolving data quality issues proactively.
  • Ensure all AI decision traces within the assigned domain include required governance metadata and comply with retention policies - contributing to the audit-readiness of the observability platform.
  • Identify and document RAI signal coverage gaps within the assigned scope - reporting findings to the L11 with sufficient detail to inform remediation planning.

4. Quality Engineering for Agentic Solutions - Post Go-Live & Continuous QE (15%)

  • Build and maintain quality gate components within CI/CD pipelines for assigned agent services - implementing regression detection logic, performance degradation alerts, and SLA breach notifications using production observability data.
  • Instrument and monitor Skill Evaluations (evals) across assigned Memory, Skills, and MCP harness components - collecting eval telemetry, tracking pass/fail trends over time, and alerting on regression thresholds with appropriate context.
  • Implement continuous quality monitoring for post-go-live agentic solutions within scope - tracking agent success rates, tool-call fidelity, latency distributions, and user outcome proxies against defined baselines.
  • Execute structured testing of new agent capabilities using standardized eval harnesses - documenting results clearly, flagging anomalies, and contributing findings to quality improvement cycles.
  • Build and maintain automated quality reports and metric dashboards for assigned areas - ensuring stakeholders have timely, accurate visibility into agent behavioral quality and trend direction.

5. Memory, Skills, MCP & Harness Engineering Observability (10%)

  • Instrument agent memory operations within the assigned scope - building reliable monitoring of read/write latency, cache hit rates, memory staleness, and semantic drift across episodic, semantic, and working memory backends.
  • Add trace instrumentation to MCP server interactions within assigned components - implementing OTEL semantic tagging for tool registrations, skill invocations, context injections, and result returns.
  • Capture telemetry for self-evolving harness and RL system components as assigned - implementing signal capture for reward distributions, policy update events, environment state transitions, and convergence indicators.
  • Monitor eval harness execution within assigned scope - building detection for flaky eval environments, setup failures, and result inconsistencies that could obscure real capability regressions.

6. Python Engineering & Data Science Observability (10%)

  • Write production-quality Python for assigned observability components - custom OTEL exporters, signal aggregators, data transformation pipelines, and anomaly detection logic - consistently meeting team engineering standards for code quality, testing, and documentation.
  • Apply data science methods to assigned telemetry data - time-series analysis, statistical threshold tuning, distribution characterization - to improve signal accuracy and reduce alerting noise within the assigned domain.
  • Contribute to shared Python SDK and library components - implementing well-tested, documented additions that improve OTEL onboarding experience for agent developers.
  • Actively participate in code reviews - both receiving feedback from peers and the L11, and contributing constructive technical review of peer engineers pull requests within areas of expertise.

7. Agent Fleet, Physical AI & Multi-Modal Observability (5%)

  • Implement telemetry for agent fleet coordination components as assigned - building signal capture for spawn/termination events, inter-agent message traces, load distribution metrics, and fleet-level health indicators.
  • Contribute to observability instrumentation for physical AI or multi-modal pipelines within the assigned scope - focusing on latency, data quality, and reliability signals as directed by the L11 architecture.
  • Document instrumentation patterns for fleet, physical AI, and multi-modal components - ensuring observability approaches are reproducible and transferable to other team members.

8. Agentic Marketplace, Registry & Agent Protocol Observability (5%)

  • Instrument assigned Agentic Marketplace and Agent Registry components with usage telemetry - building signal capture for agent invocations, capability health, adoption patterns, and dependency relationships within scope.
  • Implement protocol observability for assigned A2A, UCP, and AP2 communication flows - capturing message latency, error rates, retry patterns, and trust boundary events with sufficient granularity for incident diagnosis.
  • Contribute to Marketplace Observability Dashboard development - building data connectors, metric calculations, and visualization components for assigned areas as directed.

9. Peer Collaboration, Standards Contribution & Continuous Learning (5%)

  • Collaborate actively and constructively with peer Jr. AI Observability Architects - sharing technical knowledge, co-designing solutions to shared problems, and contributing to a high-quality, high-trust team environment.
  • Contribute to engineering standards discussions - bringing informed technical perspectives on OTEL conventions, instrumentation patterns, and telemetry design decisions based on hands-on experience in assigned tracks.
  • Participate fully in agile ceremonies - sprint planning, stand-ups, retrospectives - contributing accurate estimates, early identification of blockers, and transparent delivery status updates.
  • Stay current with evolving OTEL specifications, agent communication protocols, AI safety research, and observability tooling - proactively applying new knowledge to improve the quality and coverage of assigned work.
  • Contribute to internal documentation, engineering wikis, and instrumentation guides - ensuring that the approaches used in assigned tracks are clearly documented and accessible to the broader team.

Qualifications

  • Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field.
  • 5-10 years of professional software and AI/ML engineering or platform engineering experience, with at least 2 years of hands-on observability, distributed systems monitoring, or telemetry pipeline development.
  • Demonstrated experience delivering production-grade software end-to-end - from design through deployment and on-call operation - in a collaborative team environment.
  • Experience working in or adjacent to AI/ML platform, data engineering, or cloud infrastructure roles exposure to agentic AI systems or LLM pipelines is a strong plus.
    • Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability default-on for all agentic services.
    • Problem-Solving: Ability to translate business challenges into technical solutions.
    • Collaboration Skills: Effective at working within cross-functional teams.
    • Agility: Flexibility to adapt to changing requirements and new technologies.
    • Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.
    • Observability & OpenTelemetry: Solid hands-on proficiency with OpenTelemetry (OTEL) SDK instrumentation - custom exporters, collector configuration, semantic conventions, and distributed trace propagation. Able to independently instrument a service and validate signal quality end-to-end.
    • Python Engineering: Strong Python development skills - clean, well-tested, production-ready code. Familiarity with async patterns, type hints, testing frameworks (pytest), and CI/CD integration. Able to build and maintain Python-based telemetry tooling with minimal guidance.
    • Distributed Systems: Good working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC API design, and containerized deployment (Docker, Kubernetes). Able to reason about distributed failure modes and their observability implications.
    • Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) - managed services, IAM, storage, and cost awareness sufficient to make responsible deployment decisions.
    • Data Analysis Applied to Telemetry: Ability to query, analyze, and interpret time-series and log data using Grafana, Datadog, Prometheus, Splunk, or equivalent - including threshold tuning and basic statistical interpretation of signal distributions.
    • CI/CD & DevOps: Working experience with CI/CD pipelines, GitOps practices, automated testing, and infrastructure-as-code concepts sufficient to contribute to and extend existing pipeline configurations.
    • AI/ML Awareness: Familiarity with LLM-based workflows, agentic AI concepts, and common agent patterns (tool/function calling, RAG, memory, multi-step planning) - sufficient to understand observability requirements without needing deep ML expertise.
    • Safety & Security Fundamentals: Basic understanding of AI safety concepts (guardrails, policy enforcement, prompt injection) and data security practices (PII handling, access control, audit logging) as applied to telemetry systems.
    • Quality Engineering Basics: Familiarity with software quality concepts - regression detection, eval frameworks, test harnesses - and the ability to implement quality gate components within CI/CD pipelines using observability data.
    • RAI Awareness: Working knowledge of Responsible AI principles - fairness, explainability, bias - sufficient to implement signal capture pipelines for RAI governance requirements as specified.
    • Direct experience with agentic AI frameworks such as LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or Bedrock Agents.
    • Familiarity with MCP (Model Context Protocol), A2A, UCP, or AP2 agent communication protocols.
    • Exposure to reinforcement learning concepts, RL training infrastructure, or self-supervised learning pipelines.
    • Experience contributing to or consuming developer-facing Python SDKs or observability libraries.
    • Background with vector databases (Pinecone, Weaviate, pgvector) or semantic search in the context of RAG pipeline observability.
    • Contributions to open-source observability or AI tooling projects.
    Familiarity with AI safety frameworks, adversarial ML concepts, or red team tooling applied to LLM systems.

More Info

About Company

PepsiCo, Inc. is an American multinational food, snack, and beverage corporation headquartered in Harrison, New York, in the hamlet of Purchase. PepsiCo's business encompasses all aspects of the food and beverage market. It oversees the manufacturing, distribution, and marketing of its products. PepsiCo was formed in 1965 with the merger of the Pepsi-Cola Company and Frito-Lay, Inc. PepsiCo has since expanded from its namesake product Pepsi Cola to an immensely diversified range of food and beverage brands. The largest and most recent acquisition was Pioneer Foods in 2020 for $1.7bn [3] and before that it was the Quaker Oats Company in 2001, which added the Gatorade brand to the Pepsi portfolio and Tropicana Products in 1998.

Job ID: 147338555