Search by job, company or skills

Caw Networks

SDET 2

Save
new job description bg glownew job description bg glow
  • Posted 20 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Responsibilities

  • Build evaluation datasets and harnesses - golden sets, regression suites, and LLM-as-judge harnesses (verified against human labels).
  • Design observability so anyone can answer, Did the agent get worse this week in under 10 minutes, with a chart, not vibes.
  • Gate every prompt PR on automated eval scores embedded in design, not as a downstream gate.
  • Partner with AI engineering on red-teaming adversarial datasets for PII, jailbreaks, and prompt injection.
  • Run load and chaos tests on async LLM pipelines.

Requirements

  • 2-5 years in QA / SDET / SET/quality engineering, with at least 1-1.5 years on backend / API / systems testing.
  • Strong fluency in any modern language - TypeScript, Java, Go, or Python. Language is not a barrier.
  • Modern test framework with non-trivial fixtures and plugins: Vitest / Jest / Pytest / JUnit + RestAssured / equivalent.
  • One of the following: contract testing (Pact / Postman / Schemathesis), load testing (k6 / Locust / JMeter), distributed tracing (OpenTelemetry / Datadog / Honeycomb), or CI test infrastructure. Comfort reading backend code in PRs and using test management tooling (JIRA / Zephyr / TestRail). Bonus (genuinely a bonus, not silent rejects).
  • Hands-on with LLM / RAG/agent/voice systems Eval tooling: Langfuse, LangSmith, Phoenix, Braintrust, Ragas, DeepEval, OpenAI.
  • Evals Voice/Telephony Testing - call quality, latency, ASR/TTS evaluation. Regulated-domain QA - PII, audit trails, compliance gates, Hindi or other Indic language testing. Open-source contributions in test or eval tooling.
  • Stack we use today - AI integration: Anthropic SDK in TypeScript, embedded in our existing application. Eval/observability: Langfuse, LangSmith, OpenTelemetry, plus internal harnesses.
  • Languages: TypeScript preferred for AI app code; Python, Go, or Java elsewhere as the problem demands.
  • Coding assistants: Codex and Claude Code are part of normal development.
  • We hire on primitives - evaluation rigour, observability, contract literacy, and failure-mode imagination. Tools turn over; primitives don't.

What We're Looking For

  • Systems thinking over screen thinking. You reason about contracts, retries, latency, streaming, and async, not just what's on the page. Eval-first instinct. Asked to test a chatbot, you reach for a golden dataset, not Selenium.
  • You write code. Not glue scripts code that survives a senior engineer's review.
  • You debug from telemetry. You've found the root cause from logs and traces. You've killed a flaky test and have an opinion on why most flaky tests are actually bad tests.
  • You work alongside coding agents (Codex, Claude Code) and review their output as critically as a human would.

This job was posted by S M Nandakishore from CAW Studios.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148326311