About the job
Role Overview
As the AI Systems Architect, you'll own the end-to-end design and delivery of production-grade agentic and Generative AI systems. This is a highly hands-on role requiring deep architectural insight, coding proficiency, and an obsession with performance, scalability, and reliability. You'll architect secure, cost-efficient AI platforms on AWS, guide developers through complex debugging and optimization, and ensure all systems are observable, governed, and production-ready.
Key Responsibilities
- Architect Production AI Systems: Design robust architectures for agentic systems (planning, reasoning, tool-calling), GenAI/RAG pipelines, and evaluation workflows. Create detailed design documents, including flow/UML/sequence diagrams and AWS deployment topologies.
- Optimize for Cost & Performance: Model throughput, latency, concurrency, autoscaling, CPU/GPU sizing, and vector index performance to ensure scalable, efficient deployments.
- Lead Debugging & Stability Efforts: Conduct deep-dive debugging, fix critical defects, and resolve production incidents; pair-program with developers to improve code quality and performance.
- Standardize Agentic Frameworks: Build reference implementations using Semantic Kernel (preferred), LangGraph, AutoGen, or CrewAI with strong schema validation, grounding, and memory management.
- Engineer Retrieval & Search Systems: Architect hybrid retrieval solutions including ingestion, chunking, embeddings, ranking, caching, and freshness management while minimizing hallucination risk.
- Productionize on AWS: Deploy and manage systems using Amazon EKS, Bedrock, S3, SQS/SNS, RDS, and ElastiCache. Integrate IAM/Okta, Secrets Manager, and Datadog for observability, enforcing SLIs/SLOs and error budgets.
- Implement Observability & Monitoring: Set up distributed tracing, metrics, and logging via OpenTelemetry and Datadog. Standardize dashboards, alerts, and incident response workflows.
- Govern Evaluation & Rollouts: Build test and evaluation frameworksgolden sets, A/B experiments, regression suites, and controlled rolloutsto ensure consistent quality across releases.
- Embed Security & Safety: Enforce least privilege, PII protection, and policy compliance through threat modeling, sandboxed execution, and prompt-injection defense.
- Establish Engineering Standards: Create reusable SDKs, connectors, CI/CD templates, and architecture review checklists to promote consistency across teams.
- Cross-Functional Leadership: Collaborate with product, data, and SRE teams for capacity planning, DR strategies, and post-incident RCA reviews. Mentor engineers to strengthen design and reliability practices.
Must-Have Qualifications
- 710 years in software/AI engineering, including 4+ years in GenAI application development and 2+ years architecting agentic AI systems.
- Expert in Python 3.11+ (asyncio, typing, packaging, profiling, pytest).
- Hands-on experience with Semantic Kernel, LangGraph, AutoGen, or CrewAI.
- Proven delivery of GenAI/RAG systems on AWS Bedrock or equivalent vector-based platforms (OpenSearch Serverless, Pinecone, Redis).
- Deep understanding of AWS ecosystem: EKS, Bedrock, S3, SQS/SNS, RDS, ElastiCache, Secrets Manager, IAM/Okta, Kong API Gateway, Datadog.
- Expertise in observability and incident management using OpenTelemetry and Datadog.
- Strong focus on cost, performance, and security engineeringFinOps mindset, autoscaling, caching, and policy enforcement.
- Exceptional communicationclear diagrams, ADRs, and peer review practices.
Nice-to-Have Skills
- Multi-agent orchestration (task decomposition, coordinator-worker, graph-based planning).
- Expertise with vector databases (OpenSearch, Pinecone, pgvector, Redis).
- Experience with AI evaluation, guardrails, and rollout gating.
- Familiarity with frontend agent interfaces, secure APIs, and AuthN/Z best practices.
- Exposure to policy-as-code, multi-tenant architectures, and feature management (Kong, LaunchDarkly, Flipt).
- Experience with CI/CD via GitHub Actions and IaC (Terraform/AWS CloudFormation).