
Search by job, company or skills
Senior Python Engineer / LLM Evaluation Part-Time, Remote
We are hiring experienced Python engineers for part-time, task-based work focused on evaluating and testing Large Language Models (LLMs).
This is not traditional QA and not junior AI labeling work. We are looking for senior engineers who can reason deeply about system behavior, ambiguity, and real-world usage not just write code.
What You'll Do
Design structured test cases that simulate real human workflows
Define gold-standard outputs and expected behaviors
Analyze LLM failure modes such as hallucinations, bias, and context limitations
Work directly with Git repositories and existing codebases
Navigate incomplete documentation and ambiguous requirements
Apply engineering judgment to determine what good looks like
Who You Are
3+ years of software development experience (Python-focused)
Python is your primary language
Strong hands-on Git experience in real projects
Comfortable reading and debugging code you didn't write
Able to reason about edge cases, trade-offs, and ambiguity
Strong written and spoken English (B2+)
Nice to Have
QA or structured testing experience (must be code-capable)
Experience evaluating AI or LLM systems
Familiarity with evaluation metrics such as precision, recall, coverage
Experience working with Docker
Consulting or freelance engineering background
What We're Looking For
We value engineers who can explain why something fails not just that it fails. If you naturally think in terms of scenarios, assertions, failure modes, and user expectations, you'll thrive here.
This role suits senior backend Python engineers, ML engineers who still code regularly, and technically strong evaluators with real production experience.
Fully remote. Flexible schedule. Task-based delivery.
If you're interested in applying your engineering judgment to real-world AI system evaluation, we'd love to hear from you.
Job ID: 143833967