AI Code Evaluation Specialist - Freelance

Biz-Tech Analytics

Delhi, India

3-5 Years

Save

Posted 9 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

The Opportunity

Evaluate cutting-edge AI-generated code through real-world engineering challenges. Work with major open-source projects — data orchestration engines, ML/AI model libraries, LLM tooling frameworks, and workflow automation platforms. Your structured, evidence-based evaluations help train and benchmark the next generation of AI coding assistants.

About Biz-Tech Analytics

Biz-Tech Analytics (BTA) is a data services and AI solutions company specializing in high-quality labeled datasets, training data, and AI model evaluation. We partner with leading AI research teams and enterprises to deliver the data infrastructure that powers responsible AI development.

What You'll Do

Analyse Real Engineering Challenges: Work with complex engineering tasks sourced from major open-source projects (e.g., implementing new schedulers, adding model architectures, building tool integrations).
Write Engineering Prompts: Craft clear, self-contained prompts that describe engineering tasks for AI coding models to attempt.
Set Up Development Environments: Prepare local development setups using Git, VS Code, tmux, and Python. Clone repos, create branches, and configure dependencies.
Run Evaluation Workflows: Use provided tooling to generate AI model outputs on engineering tasks and capture the resulting code changes.
Review Code Changes: Thoroughly examine every file change in diffs. Validate correctness, code quality, adherence to best practices, and alignment with original requirements.
Complete Structured Evaluations: Write 500–800 word evaluations assessing AI-generated code across multiple dimensions including correctness, code quality, and engineering judgment. Provide evidence-backed ratings using structured rubrics.

You Must Have

3+ years of professional software development experience
Strong proficiency in Python AND at least one of: JavaScript/TypeScript, Go, Rust, Java, or C++
Experience reading and reasoning about large, unfamiliar codebases — including tracing cross-file dependencies across 10-50+ file changes
Solid understanding of software testing: unit/integration tests, test frameworks (pytest, Jest, Mocha), and ability to assess test completeness and coverage gaps
Git fluency: branching, PRs, diffs, cherry-picks, and merge conflict resolution
Ability to set up and troubleshoot development environments (Docker, tmux, VS Code, terminal/CLI workflows)
Strong written English: ability to compose detailed, structured evaluation reports (500+ words, clear logic, professional tone)
Code review experience: ability to read code critically, trace architectural impact across files, assess backward compatibility, and identify correctness, performance, and maintainability issues

Bonus Points

Contributions to open-source projects (data orchestration, ML/AI libraries, LLM tooling, workflow automation, or similar)
Experience with ML/AI frameworks (PyTorch, TensorFlow, JAX) — strongly preferred as 40% of tasks involve ML-related codebases
Familiarity with AI code generation tools (GitHub Copilot, Claude, ChatGPT)
Background in QA/testing roles, formal code review, or experience across diverse domains (backend APIs, data pipelines, ML systems, developer tooling)

Please Don't Apply If