OverviewThe Junior AI Observability Architect is an execution-focused engineer who designs, builds, and operates observability capabilities within a defined domain of the enterprise AI observability platform. Working under the strategic direction of the Senior AI Observability Architect , this role translates architecture blueprints into production-grade instrumentation, telemetry pipelines, dashboards, quality gates, and safety signals across agentic AI systems.
The junior architect is a hands-on engineer who codes, integrates, tests, and iterates - owning feature-level delivery within one or more specialization tracks while developing a growing understanding of the full observability platform. They are a technical practitioner first, with an emerging architect mindset.
Responsibilities1. Observability Platform Engineering & OTEL Integration (25%)
- Implement OpenTelemetry (OTEL) instrumentation within assigned agent frameworks or platforms - including custom exporters, span enrichers, semantic conventions, and context propagation hooks.
- Build and maintain telemetry pipeline components (collectors, processors, exporters) that route metrics, logs, traces, and semantic signals to central observability backends.
- Integrate OTEL with enterprise agentic platforms as assigned - which may include Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks - following architecture blueprints set by the L11.
- Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for the assigned sub-domain, ensuring signal quality and low false-positive rates.
- Participate in on-call rotations and incident response for the observability platform - contributing to RCA documentation and runbook improvement.
- Write unit, integration, and end-to-end tests for all telemetry components maintain 80% test coverage across owned services.
2. Safety, Security & Red Teaming Support (15%)
- Instrument safety-critical signal capture within assigned pipelines - including guardrail trigger rates, policy violation events, prompt injection detections, and hallucination flags.
- Support red team exercises by building observability hooks that capture adversarial test results, attack surface telemetry, and behavioral deviation signals in real time.
- Implement secure trace handling for sensitive AI decision events - applying data masking, PII redaction, and audit-log retention policies as defined by the security architecture.
- Assist in maintaining the Security Observability Playbook - documenting findings, updating escalation paths, and contributing to incident classification procedures.
- Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) for anomalous communication patterns and flag deviations for review by the L11 architect and security team.
3. Responsible AI (RAI) & Governance Signal Instrumentation (10%)
- Implement RAI signal collectors within assigned agent workflows - capturing fairness indicators, bias detection outputs, explainability scores, and content safety classifications.
- Maintain RAI telemetry pipelines and ensure data quality, completeness, and timeliness of governance signals feeding into compliance dashboards.
- Contribute to audit-readiness work by ensuring all AI decision traces within the assigned domain include required governance metadata and are retained per policy.
- Support gap analyses by comparing current RAI signal coverage against governance framework requirements and flagging coverage gaps to the L11.
4. Quality Engineering for Agentic Solutions - Post Go-Live & Continuous QE (15%)
- Build and maintain quality gate components within CI/CD pipelines - using observability data to detect performance regressions, behavioral drift, and SLA breaches before they reach production.
- Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack - collecting eval results, tracking pass/fail trends, and alerting on regression thresholds.
- Implement continuous quality monitoring for post-go-live agentic solutions - tracking agent success rate, tool-call fidelity, latency distributions, and user outcome proxies.
- Conduct structured testing of new agent capabilities using standardized eval harnesses - documenting results and feeding findings into quality improvement cycles.
- Develop automated quality reports and quality metric dashboards for stakeholder review, surfacing trends and anomalies in agent behavior over time.
5. Memory, Skills, MCP & Harness Engineering Observability (10%)
- Instrument agent memory operations (read/write latency, cache hit rates, memory drift) across episodic, semantic, and working memory backends within the assigned scope.
- Add trace instrumentation to MCP server interactions - tagging tool registrations, skill invocations, context injections, and result returns with semantic OTEL attributes.
- Capture harness execution telemetry for self-evolving and RL systems - logging reward signals, policy update events, environment transitions, and convergence indicators.
- Monitor skill eval harness execution pipelines - detecting flaky evals, environment setup failures, and result inconsistencies that could mask real capability regressions.
6. Data Science & Python Engineering (10%)
- Write production-grade Python for observability tooling - custom OTEL exporters, signal aggregators, anomaly detectors, and data transformation pipelines - adhering to team engineering standards.
- Apply basic statistical and data science methods to telemetry data - time-series analysis, threshold tuning, distribution characterization - to improve signal quality and alerting precision.
- Contribute to Python SDK and library development that simplifies OTEL onboarding for agent developers across the organization.
- Participate in code reviews, apply test-driven development practices, and continuously improve the quality and maintainability of the observability codebase.
7. Agent Fleet, Physical AI & Multi-Modal Observability (5%)
- Implement telemetry for agent fleet coordination - capturing spawn/termination events, inter-agent communication traces, load distribution metrics, and fleet health indicators.
- Contribute to observability instrumentation for physical AI pipelines (edge inference, sensor fusion, robotics control loops) as directed - focusing on latency, reliability, and data quality signals.
- Add OTEL instrumentation to multi-modal model pipelines - tracing vision, audio, and text input processing stages and capturing cross-modal alignment quality signals.
8. Agentic Marketplace, Registry & A2A / UCP / AP2 Observability (5%)
- Instrument the Agentic Marketplace and Agent Registry with usage telemetry - tracking agent invocations, capability health scores, adoption trends, and dependency relationships.
- Implement protocol-level observability for A2A (Agent-to-Agent), UCP, and AP2 communication flows - capturing message latency, error rates, retry patterns, and trust boundary crossings.
- Contribute to Marketplace Observability Dashboard development - building data connectors, metric calculations, and visualization components.
9. Collaboration, Integration & Continuous Learning (5%)
- Collaborate closely with AI platform engineers, data scientists, SRE, and product teams to gather requirements, align on telemetry standards, and resolve integration friction.
- Participate in agile ceremonies - sprint planning, stand-ups, retrospectives - contributing to estimation, dependency identification, and delivery transparency.
- Stay current with emerging observability frameworks, OTEL specifications, agent communication protocols, and AI safety research - sharing learnings with the team regularly.
- Contribute to internal documentation, engineering wikis, and onboarding guides for the observability platform.
Qualifications- Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field.
- 11+ years of experience in software engineering, platform engineering, or data engineering - with at least 2 years of hands-on work in observability, monitoring, or distributed systems.
- Demonstrated ability to deliver production-grade software in a team environment track record of completing complex technical features end-to-end.
- Python Proficiency: Strong Python engineering skills - writing clean, testable, maintainable production code familiarity with async patterns, type hints, and modern Python tooling (Poetry, Ruff, pytest).
- Observability Fundamentals: Solid working knowledge of the three pillars of observability (metrics, logs, traces) ability to instrument services with OpenTelemetry (OTEL) SDKs understanding of trace context propagation and semantic conventions.
- Distributed Systems: Working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC APIs, and containerized deployment (Docker, Kubernetes).
- Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) - including managed services, IAM basics, and cost awareness.
- CI/CD & DevOps: Experience building or contributing to CI/CD pipelines familiarity with GitOps, infrastructure-as-code concepts, and automated testing frameworks.
- Data Fundamentals: Ability to query, analyze, and visualize time-series and log data using tools such as Grafana, Datadog, Splunk, Prometheus, or equivalent.
- Hands-on experience with agentic AI frameworks (LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or equivalent).
- Contributions to open-source observability projects or OTEL community.
- Familiarity with reinforcement learning concepts, self-supervised learning, or model fine-tuning workflows.
- Experience with security tooling relevant to AI (adversarial robustness libraries, LLM safety frameworks, or red-team toolkits).
- Exposure to Responsible AI frameworks, fairness evaluation libraries (Arize, Fairlearn, AI Fairness 360), or explainability tools (SHAP, LIME).
- Experience in a fast-paced AI platform, MLOps, or LLMOps role with production deployment responsibilities.