OverviewThe AI Observability Senior Engineer (L9) is a seasoned individual contributor who partners alongside Jr. AI Observability Architects to deliver high-quality, production-grade observability capabilities across the enterprise AI platform. This role brings deeper technical experience and greater independent execution capability- owning delivery across one or two specialization tracks with reduced need for day-to-day direction - while working as a genuine peer within the team rather than in a supervisory or mentorship capacity.
The Sr. Engineer is expected to be a strong, self-sufficient technical contributor who can take a complex observability requirement from design through implementation and into production operation within their assigned tracks. They bring cross-track awareness that helps the team as a whole, contribute to engineering standard discussions, and collaborate with peer architects on solving shared technical challenges - all without requiring supervisory authority.
Responsibilities1. Observability Platform Engineering & OTEL Integration (25%)
- Design and implement OpenTelemetry (OTEL) instrumentation within one or two assigned agent frameworks or platforms - including custom exporters, span enrichers, semantic convention tagging, and distributed trace context propagation - with the ability to work independently from requirements through to production deployment.
- Build and maintain telemetry pipeline components (collectors, processors, exporters) that reliably route metrics, logs, traces, and semantic signals to observability backends - owning the full lifecycle of assigned pipeline components including testing, deployment, and on-call support.
- Contribute to the integration of OTEL with enterprise agentic platforms (Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks) within the assigned scope - implementing instrumentation to architecture patterns established by the L11.
- Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for assigned sub-domains - validating signal quality and tuning alert thresholds to achieve low false-positive rates.
- Participate in on-call rotations and production incident response - contributing to RCA documentation, runbook authoring, and post-incident improvement actions.
- Write comprehensive unit, integration, and end-to-end tests for all telemetry components owned maintain 80% test coverage across assigned services and proactively identify gaps in existing coverage.
2. Safety, Security & Red Teaming Observability (15%)
- Implement safety-critical signal capture within assigned telemetry pipelines - building reliable instrumentation for guardrail trigger rates, policy violation events, adversarial detection flags, hallucination indicators, and trust boundary crossing alerts.
- Build observability components that support red team exercises - instrumenting assigned agent systems to capture adversarial test events, behavioral deviations, and attack surface signals in a measurable, repeatable way.
- Implement secure trace handling patterns within assigned pipelines - applying data masking, PII redaction, and audit-log retention configurations as specified by the security architecture.
- Contribute to the Security Observability Playbook - documenting assigned instrumentation patterns, updating escalation procedures based on observed incidents, and maintaining accuracy of the playbook sections within scope.
- Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) within the assigned domain for anomalous patterns - flagging deviations for review in a timely manner with sufficient diagnostic context.
3. Responsible AI (RAI) & Governance Signal Instrumentation (10%)
- Implement RAI signal collectors within assigned agent workflows - building reliable pipelines that capture fairness indicators, bias detection outputs, explainability scores, and content safety classifications with validated data quality.
- Maintain RAI telemetry pipelines within scope - ensuring completeness, accuracy, and timeliness of governance signals that feed into compliance dashboards, and resolving data quality issues proactively.
- Ensure all AI decision traces within the assigned domain include required governance metadata and comply with retention policies - contributing to the audit-readiness of the observability platform.
- Identify and document RAI signal coverage gaps within the assigned scope - reporting findings to the L11 with sufficient detail to inform remediation planning.
4. Quality Engineering for Agentic Solutions - Post Go-Live & Continuous QE (15%)
- Build and maintain quality gate components within CI/CD pipelines for assigned agent services - implementing regression detection logic, performance degradation alerts, and SLA breach notifications using production observability data.
- Instrument and monitor Skill Evaluations (evals) across assigned Memory, Skills, and MCP harness components - collecting eval telemetry, tracking pass/fail trends over time, and alerting on regression thresholds with appropriate context.
- Implement continuous quality monitoring for post-go-live agentic solutions within scope - tracking agent success rates, tool-call fidelity, latency distributions, and user outcome proxies against defined baselines.
- Execute structured testing of new agent capabilities using standardized eval harnesses - documenting results clearly, flagging anomalies, and contributing findings to quality improvement cycles.
- Build and maintain automated quality reports and metric dashboards for assigned areas - ensuring stakeholders have timely, accurate visibility into agent behavioral quality and trend direction.
5. Memory, Skills, MCP & Harness Engineering Observability (10%)
- Instrument agent memory operations within the assigned scope - building reliable monitoring of read/write latency, cache hit rates, memory staleness, and semantic drift across episodic, semantic, and working memory backends.
- Add trace instrumentation to MCP server interactions within assigned components - implementing OTEL semantic tagging for tool registrations, skill invocations, context injections, and result returns.
- Capture telemetry for self-evolving harness and RL system components as assigned - implementing signal capture for reward distributions, policy update events, environment state transitions, and convergence indicators.
- Monitor eval harness execution within assigned scope - building detection for flaky eval environments, setup failures, and result inconsistencies that could obscure real capability regressions.
6. Python Engineering & Data Science Observability (10%)
- Write production-quality Python for assigned observability components - custom OTEL exporters, signal aggregators, data transformation pipelines, and anomaly detection logic - consistently meeting team engineering standards for code quality, testing, and documentation.
- Apply data science methods to assigned telemetry data - time-series analysis, statistical threshold tuning, distribution characterization - to improve signal accuracy and reduce alerting noise within the assigned domain.
- Contribute to shared Python SDK and library components - implementing well-tested, documented additions that improve OTEL onboarding experience for agent developers.
- Actively participate in code reviews - both receiving feedback from peers and the L11, and contributing constructive technical review of peer engineers pull requests within areas of expertise.
7. Agent Fleet, Physical AI & Multi-Modal Observability (5%)
- Implement telemetry for agent fleet coordination components as assigned - building signal capture for spawn/termination events, inter-agent message traces, load distribution metrics, and fleet-level health indicators.
- Contribute to observability instrumentation for physical AI or multi-modal pipelines within the assigned scope - focusing on latency, data quality, and reliability signals as directed by the L11 architecture.
- Document instrumentation patterns for fleet, physical AI, and multi-modal components - ensuring observability approaches are reproducible and transferable to other team members.
8. Agentic Marketplace, Registry & Agent Protocol Observability (5%)
- Instrument assigned Agentic Marketplace and Agent Registry components with usage telemetry - building signal capture for agent invocations, capability health, adoption patterns, and dependency relationships within scope.
- Implement protocol observability for assigned A2A, UCP, and AP2 communication flows - capturing message latency, error rates, retry patterns, and trust boundary events with sufficient granularity for incident diagnosis.
- Contribute to Marketplace Observability Dashboard development - building data connectors, metric calculations, and visualization components for assigned areas as directed.
9. Peer Collaboration, Standards Contribution & Continuous Learning (5%)
- Collaborate actively and constructively with peer Jr. AI Observability Architects - sharing technical knowledge, co-designing solutions to shared problems, and contributing to a high-quality, high-trust team environment.
- Contribute to engineering standards discussions - bringing informed technical perspectives on OTEL conventions, instrumentation patterns, and telemetry design decisions based on hands-on experience in assigned tracks.
- Participate fully in agile ceremonies - sprint planning, stand-ups, retrospectives - contributing accurate estimates, early identification of blockers, and transparent delivery status updates.
- Stay current with evolving OTEL specifications, agent communication protocols, AI safety research, and observability tooling - proactively applying new knowledge to improve the quality and coverage of assigned work.
- Contribute to internal documentation, engineering wikis, and instrumentation guides - ensuring that the approaches used in assigned tracks are clearly documented and accessible to the broader team.
Qualifications- Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field.
- 5-10 years of professional software and AI/ML engineering or platform engineering experience, with at least 2 years of hands-on observability, distributed systems monitoring, or telemetry pipeline development.
- Demonstrated experience delivering production-grade software end-to-end - from design through deployment and on-call operation - in a collaborative team environment.
- Experience working in or adjacent to AI/ML platform, data engineering, or cloud infrastructure roles exposure to agentic AI systems or LLM pipelines is a strong plus.
- Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability default-on for all agentic services.
- Problem-Solving: Ability to translate business challenges into technical solutions.
- Collaboration Skills: Effective at working within cross-functional teams.
- Agility: Flexibility to adapt to changing requirements and new technologies.
- Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.
- Observability & OpenTelemetry: Solid hands-on proficiency with OpenTelemetry (OTEL) SDK instrumentation - custom exporters, collector configuration, semantic conventions, and distributed trace propagation. Able to independently instrument a service and validate signal quality end-to-end.
- Python Engineering: Strong Python development skills - clean, well-tested, production-ready code. Familiarity with async patterns, type hints, testing frameworks (pytest), and CI/CD integration. Able to build and maintain Python-based telemetry tooling with minimal guidance.
- Distributed Systems: Good working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC API design, and containerized deployment (Docker, Kubernetes). Able to reason about distributed failure modes and their observability implications.
- Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) - managed services, IAM, storage, and cost awareness sufficient to make responsible deployment decisions.
- Data Analysis Applied to Telemetry: Ability to query, analyze, and interpret time-series and log data using Grafana, Datadog, Prometheus, Splunk, or equivalent - including threshold tuning and basic statistical interpretation of signal distributions.
- CI/CD & DevOps: Working experience with CI/CD pipelines, GitOps practices, automated testing, and infrastructure-as-code concepts sufficient to contribute to and extend existing pipeline configurations.
- AI/ML Awareness: Familiarity with LLM-based workflows, agentic AI concepts, and common agent patterns (tool/function calling, RAG, memory, multi-step planning) - sufficient to understand observability requirements without needing deep ML expertise.
- Safety & Security Fundamentals: Basic understanding of AI safety concepts (guardrails, policy enforcement, prompt injection) and data security practices (PII handling, access control, audit logging) as applied to telemetry systems.
- Quality Engineering Basics: Familiarity with software quality concepts - regression detection, eval frameworks, test harnesses - and the ability to implement quality gate components within CI/CD pipelines using observability data.
- RAI Awareness: Working knowledge of Responsible AI principles - fairness, explainability, bias - sufficient to implement signal capture pipelines for RAI governance requirements as specified.
- Direct experience with agentic AI frameworks such as LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or Bedrock Agents.
- Familiarity with MCP (Model Context Protocol), A2A, UCP, or AP2 agent communication protocols.
- Exposure to reinforcement learning concepts, RL training infrastructure, or self-supervised learning pipelines.
- Experience contributing to or consuming developer-facing Python SDKs or observability libraries.
- Background with vector databases (Pinecone, Weaviate, pgvector) or semantic search in the context of RAG pipeline observability.
- Contributions to open-source observability or AI tooling projects.
Familiarity with AI safety frameworks, adversarial ML concepts, or red team tooling applied to LLM systems.