
Search by job, company or skills
Role - Director of AI Data
CiteWorks Studio is hiring a Director of AI Data to lead the development of datasets and data infrastructure used to study how large language models retrieve information, generate answers, and cite sources.
This leadership role focuses on building large-scale data pipelines that collect and analyze AI responses across systems such as ChatGPT, Claude, Gemini, Perplexity, and open-source large language models.
What is AI Data Infrastructure
AI data infrastructure refers to the systems used to collect, process, organize, and analyze the data that powers machine learning and artificial intelligence models.
For large language models, AI data infrastructure may include:
prompt-response datasets
model evaluation datasets
citation extraction pipelines
retrieval benchmarking datasets
large-scale training data collections
These systems allow researchers to study how AI models generate answers and retrieve knowledge.
What Does a Director of AI Data Do
A Director of AI Data leads the strategy and development of data systems used for machine learning research and AI analysis.
The role focuses on building the datasets and pipelines required to analyze the behavior of large language models.
This includes developing systems that collect and structure:
AI-generated responses
prompt testing datasets
citation data
entity recognition signals
generative search outputs
The Director ensures that researchers and engineers have the data needed to analyze how AI systems retrieve, synthesize, and cite information.
About CiteWorks Studio
CiteWorks Studio is an AI research and generative engine optimization (GEO) firm focused on understanding how large language models retrieve and cite information.
Modern AI systems such as ChatGPT, Gemini, Claude, and Perplexity increasingly function as the primary interface for information discovery. Instead of ranking links like traditional search engines, these systems generate answers by retrieving and synthesizing knowledge from multiple sources.
CiteWorks Studio studies this transformation and helps organizations understand:
how AI systems determine trusted sources
how citation patterns appear inside AI-generated answers
how knowledge graphs influence model responses
how organizations become trusted references in generative search systems
Our research focuses on AI citation intelligence, generative search benchmarking, and LLM retrieval systems.
Key Responsibilities
The Director of AI Data will lead the development of large-scale datasets used to analyze how generative AI systems behave.
Responsibilities include:
building data pipelines that collect AI responses across multiple LLM platforms
designing datasets used to benchmark generative AI systems
developing systems that extract citations from AI-generated answers
creating structured datasets used to analyze retrieval patterns
managing prompt testing datasets used in AI evaluation
collaborating with machine learning researchers and engineers to support AI benchmarking systems
The role also involves developing the data infrastructure needed to analyze AI citation behavior and generative search systems at scale.
Why AI Data Infrastructure Matters
Large language models generate answers by retrieving and synthesizing information from large datasets and external knowledge sources.
Understanding how these systems behave requires structured datasets that capture:
model responses across prompts
citations included in AI answers
variability between models
hallucination patterns
knowledge retrieval behavior
AI data infrastructure enables researchers to analyze how generative AI systems retrieve and use information.
Data Systems This Role Will Build
The Director will help design data systems used to analyze the behavior of AI models.
Prompt Response Datasets
Large collections of prompts and AI-generated answers used to study model behavior.
Citation Extraction Systems
Pipelines that identify and record sources cited inside AI-generated responses.
Retrieval Benchmark Datasets
Datasets used to analyze how AI models retrieve information from different sources.
Cross-Model Comparison Data
Data used to compare outputs from multiple AI systems.
Knowledge Graph Signal Datasets
Structured datasets used to analyze how entities and sources appear in AI responses.
Qualifications
Required
8+ years experience in data engineering, machine learning infrastructure, or AI systems
experience building large-scale data pipelines or ML datasets
strong understanding of large language models and AI systems
experience working with distributed data systems and large datasets
ability to lead technical data teams and collaborate with researchers
Preferred
experience building datasets for machine learning evaluation or benchmarking
familiarity with retrieval augmented generation (RAG) systems
experience analyzing large language model outputs or AI-generated responses
background in NLP or information retrieval systems
Why Join CiteWorks Studio
This role sits at the frontier of AI search research and generative AI systems.
The Director of AI Data will build the infrastructure needed to analyze millions of AI-generated responses and study how models retrieve and cite information.
As generative AI becomes the primary interface for information discovery, understanding AI data pipelines and retrieval behavior will become increasingly important.
Key Terms
Large Language Model (LLM)
A machine learning model trained on massive datasets that can generate text, answer questions, and perform reasoning tasks.
AI Data Infrastructure
The systems used to collect, process, and organize data used by machine learning models and AI research.
Generative Search
A form of search where AI systems generate answers by synthesizing information instead of returning ranked links.
AI Citation Intelligence
The analysis of how frequently specific sources appear in AI-generated responses.
Job ID: 144563463