Director AI Evaluation and Benchmarking

CiteWorks Studio

Pune, India

8-10 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Role - Director of AI Evaluation and Benchmarking

CiteWorks Studio is hiring a Director of AI Evaluation and Benchmarking to lead research into how large language models generate answers, retrieve information, and cite sources.

This leadership role focuses on developing evaluation frameworks that analyze the behavior of AI systems such as ChatGPT, Claude, Gemini, Perplexity, and open-source large language models.

What Is AI Evaluation and Benchmarking

AI evaluation and benchmarking is the process of systematically testing artificial intelligence systems to measure their accuracy, reliability, reasoning ability, and citation behavior.

For large language models, evaluation frameworks measure how well models:

generate correct answers

cite trustworthy sources

retrieve relevant information

avoid hallucinations

maintain consistent responses across prompts

AI benchmarking helps researchers understand how different AI systems behave and which models perform best across different tasks.

What Does a Director of AI Evaluation and Benchmarking Do

A Director of AI Evaluation and Benchmarking leads the development of systems used to test and analyze large language models.

This role focuses on measuring how AI systems generate answers, retrieve information, and determine which sources to cite.

The Director designs evaluation frameworks that analyze:

model accuracy

citation reliability

hallucination frequency

reasoning performance

retrieval consistency

The role sits at the intersection of machine learning research, information retrieval, and generative AI systems.

About CiteWorks Studio

CiteWorks Studio is an AI research and generative engine optimization (GEO) firm focused on understanding how large language models retrieve and cite information.

Modern AI systems such as ChatGPT, Claude, Gemini, and Perplexity increasingly function as the primary interface for information discovery. Instead of ranking links like traditional search engines, these systems generate answers by retrieving and synthesizing information from trusted sources.

CiteWorks Studio studies this transformation and helps organizations understand:

how AI systems determine trusted sources

how citation patterns emerge inside AI-generated answers

how knowledge graphs influence model responses

how organizations become trusted references in generative search systems

Our research focuses on AI citation intelligence, generative search benchmarking, and LLM retrieval systems.

Key Responsibilities

The Director of AI Evaluation and Benchmarking will lead the development of systems that analyze how large language models behave across different tasks and prompts.

Responsibilities include:

designing evaluation frameworks for large language models

building prompt testing systems that analyze AI responses

benchmarking AI models across accuracy, reasoning, and citation reliability

measuring hallucination rates and model reliability

analyzing how generative AI systems retrieve and synthesize knowledge

comparing performance across models such as ChatGPT, Claude, Gemini, and open-source LLMs

publishing research on AI evaluation and generative search behavior

Why AI Benchmarking Matters

Traditional search engines return ranked web pages.

Large language models generate answers.

Because these systems synthesize information rather than simply ranking pages, it becomes essential to measure how reliable and trustworthy the generated answers are.

AI benchmarking frameworks help researchers understand:

how often models generate correct answers

which sources models choose to cite

how frequently hallucinations occur

how models behave across different prompts and tasks

These insights are essential for improving the reliability and transparency of generative AI systems.

Evaluation Areas This Role Will Study

The Director will oversee evaluation frameworks that analyze multiple aspects of AI system behavior.

Model Accuracy

Testing how frequently models produce correct answers.

Citation Reliability

Measuring how often models cite trustworthy sources.

Hallucination Detection

Identifying cases where models generate incorrect or fabricated information.

Retrieval Behavior

Studying how AI systems retrieve and synthesize information.

Cross-Model Benchmarking

Comparing performance across different AI platforms.

Qualifications

Required

8+ years experience in machine learning, AI research, or data science

strong understanding of large language models and transformer architectures

experience building machine learning evaluation or benchmarking systems

background in natural language processing (NLP) or information retrieval

experience designing testing frameworks for complex systems

Preferred

experience evaluating large language models or generative AI systems

familiarity with retrieval-augmented generation (RAG) systems

experience analyzing hallucination and reliability in AI systems

background in AI safety or model testing

Why Join CiteWorks Studio

This role sits at the frontier of AI search and generative AI research.

The Director of AI Evaluation and Benchmarking will help develop the frameworks used to measure how modern AI systems retrieve knowledge and generate answers.

As generative AI becomes the primary interface for information discovery, evaluation frameworks will become essential for understanding how AI systems determine trusted sources.

Key Terms

Large Language Model (LLM)

A machine learning model trained on massive datasets that can generate text, answer questions, and perform reasoning tasks.

AI Benchmarking

The process of testing artificial intelligence systems using standardized prompts, datasets, and evaluation metrics.

Generative Search

A form of search where AI systems generate answers by synthesizing information instead of returning ranked links.

AI Citation Intelligence

The analysis of how frequently specific sources appear in AI-generated responses.