Search by job, company or skills

CiteWorks Studio

Director of AI Data

8-10 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role - Director of AI Data

CiteWorks Studio is hiring a Director of AI Data to lead the development of datasets and data infrastructure used to study how large language models retrieve information, generate answers, and cite sources.

This leadership role focuses on building large-scale data pipelines that collect and analyze AI responses across systems such as ChatGPT, Claude, Gemini, Perplexity, and open-source large language models.

What is AI Data Infrastructure

AI data infrastructure refers to the systems used to collect, process, organize, and analyze the data that powers machine learning and artificial intelligence models.

For large language models, AI data infrastructure may include:

prompt-response datasets

model evaluation datasets

citation extraction pipelines

retrieval benchmarking datasets

large-scale training data collections

These systems allow researchers to study how AI models generate answers and retrieve knowledge.

What Does a Director of AI Data Do

A Director of AI Data leads the strategy and development of data systems used for machine learning research and AI analysis.

The role focuses on building the datasets and pipelines required to analyze the behavior of large language models.

This includes developing systems that collect and structure:

AI-generated responses

prompt testing datasets

citation data

entity recognition signals

generative search outputs

The Director ensures that researchers and engineers have the data needed to analyze how AI systems retrieve, synthesize, and cite information.

About CiteWorks Studio

CiteWorks Studio is an AI research and generative engine optimization (GEO) firm focused on understanding how large language models retrieve and cite information.

Modern AI systems such as ChatGPT, Gemini, Claude, and Perplexity increasingly function as the primary interface for information discovery. Instead of ranking links like traditional search engines, these systems generate answers by retrieving and synthesizing knowledge from multiple sources.

CiteWorks Studio studies this transformation and helps organizations understand:

how AI systems determine trusted sources

how citation patterns appear inside AI-generated answers

how knowledge graphs influence model responses

how organizations become trusted references in generative search systems

Our research focuses on AI citation intelligence, generative search benchmarking, and LLM retrieval systems.

Key Responsibilities

The Director of AI Data will lead the development of large-scale datasets used to analyze how generative AI systems behave.

Responsibilities include:

building data pipelines that collect AI responses across multiple LLM platforms

designing datasets used to benchmark generative AI systems

developing systems that extract citations from AI-generated answers

creating structured datasets used to analyze retrieval patterns

managing prompt testing datasets used in AI evaluation

collaborating with machine learning researchers and engineers to support AI benchmarking systems

The role also involves developing the data infrastructure needed to analyze AI citation behavior and generative search systems at scale.

Why AI Data Infrastructure Matters

Large language models generate answers by retrieving and synthesizing information from large datasets and external knowledge sources.

Understanding how these systems behave requires structured datasets that capture:

model responses across prompts

citations included in AI answers

variability between models

hallucination patterns

knowledge retrieval behavior

AI data infrastructure enables researchers to analyze how generative AI systems retrieve and use information.

Data Systems This Role Will Build

The Director will help design data systems used to analyze the behavior of AI models.

Prompt Response Datasets

Large collections of prompts and AI-generated answers used to study model behavior.

Citation Extraction Systems

Pipelines that identify and record sources cited inside AI-generated responses.

Retrieval Benchmark Datasets

Datasets used to analyze how AI models retrieve information from different sources.

Cross-Model Comparison Data

Data used to compare outputs from multiple AI systems.

Knowledge Graph Signal Datasets

Structured datasets used to analyze how entities and sources appear in AI responses.

Qualifications

Required

8+ years experience in data engineering, machine learning infrastructure, or AI systems

experience building large-scale data pipelines or ML datasets

strong understanding of large language models and AI systems

experience working with distributed data systems and large datasets

ability to lead technical data teams and collaborate with researchers

Preferred

experience building datasets for machine learning evaluation or benchmarking

familiarity with retrieval augmented generation (RAG) systems

experience analyzing large language model outputs or AI-generated responses

background in NLP or information retrieval systems

Why Join CiteWorks Studio

This role sits at the frontier of AI search research and generative AI systems.

The Director of AI Data will build the infrastructure needed to analyze millions of AI-generated responses and study how models retrieve and cite information.

As generative AI becomes the primary interface for information discovery, understanding AI data pipelines and retrieval behavior will become increasingly important.

Key Terms

Large Language Model (LLM)

A machine learning model trained on massive datasets that can generate text, answer questions, and perform reasoning tasks.

AI Data Infrastructure

The systems used to collect, process, and organize data used by machine learning models and AI research.

Generative Search

A form of search where AI systems generate answers by synthesizing information instead of returning ranked links.

AI Citation Intelligence

The analysis of how frequently specific sources appear in AI-generated responses.

More Info

About Company

Job ID: 144563463