Staff Software Engineer
Overview
About Business Unit:
The Product team forms the crux of our powerful platforms and helps connect millions of customers worldwide with the brands that matter most to them. This team of innovative problem solvers develops and builds products that position Epsilon as a differentiator, encouraging an open and balanced marketplace built on respect for individuals, where every brand interaction holds value. Our full-cycle product engineering and data teams chart the future and set new benchmarks for our products, by using industry standard methodologies and sophisticated capabilities in data, machine learning, and artificial intelligence. Driven by a passion for delivering smart end-to-end solutions, this team plays a key role in Epsilon's success story.
The Opportunity:
This is a founding-engineer-level role in a PMO context. You will be among the first people to define how AI-augmented program management looks inside a large GCC. The tools you build will be used daily by 300+ engineers, Scrum Masters, and senior leadership. The scope will grow as the team scales.
Epsilon's Agile COE has already deployed RAG-based backlog analysis and Cursor AI reporting. You are stepping into a team with real tooling ambition, institutional support, and an organization that thinks in agentic architectures.
What You Will Build
As one of the early engineers joining this team, you will own the tooling of our agentic AI platform. Immediate priorities include:
- AI enabled Intelligence Dashboard — a real-time program health surface combining delivery predictability, backlog quality scores, risk signals, and AI-generated actionable insights for engineering, product customers.
- RAG-Powered Backlog Analysis — a Retrieval-Augmented Generation pipeline that ingests JIRA/Bitbucket/ Confluence artefacts and surfaces requirement intake readiness, backlog gaps, dependency risks, and engineering delivery health.
- Developer Productivity Instruments — integrations with Cursor/AWS AI APIs (commits, /changes) to measure AI-generated code ratios across engineering teams and feed COE objectives and key results.
- Multi-Agent Orchestration Layer — LangChain / LangGraph agent chains that autonomously gather data, call internal APIs, synthesize insights, and report them.
- LLM Gateway & Cost Controls — token budgeting, model routing (GPT-4o, Claude, Gemini), caching, and observability
Click here to view how Epsilon transforms marketing with 1 View, 1 Vision and 1 Voice.
Responsibilities
Agentic AI Engineering
- Design and build production-grade agentic SDLC workflows— from prompt design through tool integration to autonomous execution loops.
- Implement RAG pipelines (vector ingestion, chunking strategy, embedding tuning, retrieval evaluation) over enterprise knowledge bases (JIRA, Confluence, internal wikis).
- Integrate LLM APIs (OpenAI, Anthropic, Azure OpenAI, Gemini) with proper rate-limit handling, cost governance, and fallback routing.
- Build and maintain multi-modal agent tool libraries: JIRA query agents, Confluence fetch agents, GitHub/Bitbucket diff agents, and calendar-aware sprint agents.
Full-Stack Dashboard Development
- Develop front-ends for delivery dashboards with real-time data, drill-down charts and actionable AI-insights
- Build backends that aggregate data from internal planning tools, and serve them to the front-end and agent layer.
- Design and maintain dB schemas for program metrics, sprint snapshots, and model response logs.
Platform & MLOps
- Containerize and deploy services on AWS (ECS / Lambda / S3) or Azure using Docker and CI/CD pipelines (GitHub Actions, Jenkins).
- Instrument LLM observability: latency P95, token spend, hallucination-flag rates, user feedback loops.
- Maintain version control and A/B testing to iterate on model behaviour without redeployment.
Collaboration & COE Enablement
- Partner with Program Managers, Product Owners, Scrum Masters to translate delivery challenges into agent-addressable requirements.
- Document agent designs, API contracts, and RAG architecture decisions in Confluence for reuse
- Contribute to Agile COE metrics strategy — co-define Key Performance Indicator and ensure the data pipelines feeding dashboards are audit-ready.
Qualifications
Must-Have Qualifications:
Agentic AI (Non-Negotiable Core)
- 2+ years hands-on building LLM-powered applications in production — not prototypes.
- Proficiency with at least one agent framework: LangChain, LangGraph, AutoGen, CrewAI, or Semantic Kernel.
- Practical RAG experience: chunking, embedding, vector stores and retrieval evaluation
- Working experience integrating OpenAI, Anthropic, or Azure OpenAI APIs — tool/function calling, structured outputs, streaming.
- Design prompt templates, system prompts, and chain-of-thought scaffolding for reliability at scale.
Backend Engineering
- 6-8+ years with Python (preferred) or Java for API and service development.
- REST API design and implementation using FastAPI, Flask, Spring Boot, or Node.js/Express.
- SQL fluency: query optimisation, schema design, and data modelling across PostgreSQL, MySQL, or Snowflake.
- Solid understanding of async patterns, message queues (Kafka, SQS, RabbitMQ), and event-driven architectures.
Frontend
- React + TypeScript with hooks, context, and component composition patterns.
- Data visualisation with Recharts, Chart.js, or D3 for operational dashboards.
- Experience consuming streaming REST or WebSocket APIs for real-time UI updates.
Cloud & DevOps
- AWS or Azure: compute (EC2/ECS/Lambda or AKS), storage (S3/Blob), managed databases (RDS/Cosmos).
- Docker, container orchestration basics, and CI/CD pipeline ownership.
- Monitoring and alerting with CloudWatch, Datadog, or equivalent.
Good to Have:
- Experience with JIRA APIs, Confluence REST API, or GitHub/Bitbucket APIs for programmatic data extraction.
- Familiarity with Cursor AI, GitHub Copilot Metrics, or developer productivity instrumentation tooling.
- Knowledge of LLM evaluation frameworks: LangSmith, PromptFlow, or RAGAS.
- Background in Agile delivery metrics — velocity, cycle time, defect escape rate, DORA metrics.
- Exposure to knowledge graph or graph database approaches for dependency mapping.
- Prior work in a GCC, captive centre, or large multi-geography engineering org.
Behavioural Attributes:
- Builder Attitude
- You ship. Prototypes become products. You are not satisfied until it is in production and being used.
- Curious & Self-Directed
- The agentic AI landscape changes weekly. You track it, experiment, and bring learnings back to the team.
- Stakeholder Translator
- You can walk a Program Manager through what a RAG pipeline does and why the retrieval quality matters to their sprint review.
- High Ownership
- Small team. You own your services end-to-end — design, deploy, monitor, iterate. No ticket-and-forget culture.