Location
Hybrid / Remote (global, aligned to target customer time zones)
Role Type
Full-time | Principal Level
Role Overview
Centific's DAC (Digital Architecture & Cognitive) Command is expanding its global architecture unit to build and operationalize agentic, AI-driven business automation at production scale.
In this role, you will act as the end-to-end design authority for agentic inference solutionsowning outcomes from blueprint to live operations. You will architect multi-agent systems, runtime orchestration, and operational guardrails that meet demanding non-functional requirements (latency, reliability, cost, and security).
This is a hands-on role. You will prototype reference implementations, tune runtime behavior, and partner with engineering, platform, security, and product stakeholders to deliver production-first agentic systems.
Key Responsibilities
1. Agentic System Architecture & Orchestration
- Design multi-agent architectures (plannerexecutor, supervisor loops, routing/dispatch, delegation, reflection/verification patterns) aligned to business workflows.
- Define orchestration mechanisms for state/session handling, memory (short/long-term), tool invocation, retrieval/RAG, and structured I/O.
- Establish standards for prompt/agent templates, tool/skill contracts, agent-to-agent messaging, and deterministic fallbacks.
- Create reference implementations that teams can extend safely (agent frameworks, orchestration services, reusable libraries).
2. NFR-Driven Design for Production Inference
- Own non-functional design (latency, throughput, scalability, reliability, availability, cost) as first-class requirements.
- Design for performance and cost: token budgeting, caching strategies, batching, streaming responses, concurrency controls, and adaptive routing.
- Define resilience patterns: timeouts, retries, circuit breakers, idempotency, queue back-pressure, graceful degradation, and safe-mode behavior.
- Drive architecture decisions that balance quality vs. cost vs. speeddocumenting trade-offs and expected SLOs/SLAs.
3. Solution Blueprint Ownership & End-to-End Delivery
- Own the end-to-end solution blueprint from concept through production rollout (architecture, integration, testing, operations).
- Translate business intent into system decomposition (services, agents, tools, data flows) with clear ownership boundaries and contracts.
- Collaborate with Solution Blueprint Architects, Platform Architects, Data/Governance, and Security/Compliance to align constraints early.
- Deliver architecture artifacts: sequence diagrams, decision records (ADRs), integration specs, runbooks, acceptance criteria, and launch checklists.
4. Integration Governance & Platform Compatibility
- Set integration standards for APIs/events (versioning, compatibility contracts, error semantics, schema governance).
- Define interfaces for tool invocation (capabilities registry, permissions, rate limits, safe parameterization).
- Ensure agentic systems integrate cleanly with enterprise platforms (IAM, logging, monitoring, workflow engines, data platforms).
- Partner with enterprise architecture to ensure interoperability across domains and prevent fragmentation.
5. Operational Readiness & Reliability
- Design and enforce operational guardrails: monitoring, alerting, evaluation hooks, rollback plans, and safety kill-switches.
- Establish runbooks for incident response, model/agent degradation, and dependency failures (tools, data sources, external APIs).
- Define observability standards for agent traces, tool calls, prompts/responses, evaluation scores, and cost telemetry.
- Lead postmortems and reliability improvements; ensure corrective actions are implemented and verified.
6. Technical Leadership & Enablement
- Act as a principal technical leaderaligning cross-functional teams on architecture, roadmap, and delivery priorities.
- Mentor engineers/architects on agentic design patterns, evaluation, and production hardening.
- Drive reuse: shared components, gold-standard reference flows, and platform primitives that accelerate delivery.
- Contribute to architecture councils/design reviews; influence standards and best practices across DAC Command.
Required Experience & Skills
Core Experience
- 1015+ years in software/platform engineering with 5+ years in solution/AI/platform architecture roles.
- Proven delivery of production-grade AI/LLM systems (not just prototypes), including operational ownership considerations.
- Strong background in distributed systems, API/event-driven integration, and reliability engineering.
Agentic AI & LLM Runtime Expertise (Hands-On)
- Deep experience with agentic patterns: multi-agent coordination, planning, tool calling, routing, memory, and state management.
- Experience optimizing LLM inference: caching, batching, token/latency management, throughput tuning, and quality-cost trade-offs.
- Strong understanding of evaluation strategies (offline/online), prompt/agent regression testing, and release gates.
- Familiarity with common orchestration frameworks and patterns (e.g., graph-based agent flows, tool registries, function calling).
Platform & Operations
- Strong cloud-native architecture experience (AWS/Azure/GCP), microservices, event streaming, and container/Kubernetes ecosystems.
- Hands-on with observability stacks (logs/metrics/traces), SLO/error budgets, incident response practices, and postmortems.
- Ability to design secure-by-default tool access patterns (least privilege, scoped tokens, auditability).
Soft Skills & Ways of Working
- Production-first mindset: design for operability, safety, and reliability from day one.
- Strong systems thinking: can reason across product, platform, data, security, and cost dimensions.
- Clear communicator: able to explain architecture trade-offs to engineers, product, and executive stakeholders.
- Bias for action: prototypes quickly, then codifies reusable standards and reference implementations.
- Collaborative leadership: aligns teams without relying on formal authority.
Nice-to-Have / Preferred
- Experience with large-scale workflow orchestration and automation platforms (BPM/workflow engines, event-driven pipelines).
- Experience implementing agent observability and evaluation harnesses at scale.
- Background in regulated environments (SOC2, HIPAA, PCI, CJIS) and designing AI systems with audit-ready traces.
- Open-source contributions, talks, or published work in agentic systems, LLM infrastructure, or reliability engineering.
What Success Looks Like (First 1218 Months)
- Agentic reference architectures and runtime standards are adopted across DAC Command deliveries.
- Production deployments meet defined SLOs for latency, availability, and cost; incident rates reduce over time through reliability improvements.
- Reusable orchestration primitives (routing, memory, tool registry, evaluation hooks) accelerate new use cases and reduce duplication.
- Integration governance prevents fragmentationAPIs/events are versioned, compatible, and observable.
- Teams trust the platform: safe rollouts, clear runbooks, and measurable quality/cost improvements are in place.