Overview:
Design and build challenging, real-world terminal-based tasks for evaluating frontier AI agents. Tasks must be genuinely difficult, clearly specified, and programmatically verifiable.
Responsibilities:
- Design high-quality task ideas rooted in real-world workflows (debugging, infra setup, data pipelines, security, ML training, etc.)
- Write clear, unambiguous task instructions with defined end states
- Build Docker environments and write oracle solutions that pass all tests
- Write deterministic pytest-based verification scripts
- Identify edge cases and ensure tasks can't be shortcut or gamed by AI agents
- Iterate with reviewers based on QC and platform gate feedback
Must-Haves:
- 3–5+ years of hands-on engineering experience in at least one domain (SWE, DevOps, ML, security, data engineering, scientific computing)
- Proficiency in Python and shell scripting (bash)
- Comfortable writing Dockerfiles, building images, and debugging containers
- Experience writing automated tests (pytest, unittest)
- Familiarity with Git workflows (PRs, diffs, branching)
- Strong technical writing - ability to produce precise, unambiguous specifications
Nice-to-Haves:
- Experience with AI coding benchmarks (SWE-bench, Terminal-Bench, GPQA)
- Open-source contributions or GitHub PR history in relevant repos
- Experience with the Harbor evaluation framework
- Background in competitive programming or Kaggle
- Domain depth in niche areas (kernel dev, cryptography, HPC, media processing)
- Masters or PhD in CS is preferred
Engagement:
- Fully remote
- Fixed rate per accepted task: $40 - $60 + performance based bonus