Job Description
Job Description-
AI Evaluation & Test Engineer Job Summary We are looking for an AI Evaluation & Test Engineer to ensure that our generative AI models and applications are safe, accurate, trustworthy, and deliver elegant user experience. The test engineer will act as a customer of our AI systems, obsessing over quality, usability, and customercentric system behavior. This role is ideal for productminded engineering professionals passionate about shaping the behavior of AI systems in realworld scenarios.
Key Responsibilities
- Build and maintain AI evaluation pipelines to test and measure model behavior and system performance.
- Define AI quality metrics including factuality, faithfulness, grounding precision/recall, toxicity, latency, and cost.
- Develop automation for AI evaluation and regression testing at scale.
- Establish release gates in CI/CD aligning with BCCI TCOE QA maturity framework.
- Perform functional testing including test case creation, execution, and reporting in ALM and JIRA.
- Design and maintain end-to-end Selenium automation using Java for UI validation.
- Conduct adversarial, negative, and stress testing to identify vulnerabilities.
- Collaborate with product, engineering, ML, and user experience teams to refine AI behavior and interactions.
- 3+ years in manual and automation testing with at least 1year in AI/ML evaluation.
- Strong knowledge of functional QA processes with experience writing test plans, test cases, and reports.
- Hands-on experience with JIRA and OpenText ALM for lifecycle management tests.
- Strong Selenium Java experience for E2E UI automation.
- Working knowledge of generative AI, RAG, prompting, explainability, guardrails, and observability.
- Understanding differences between traditional software testing and AI/ML model evaluation.
- Strong debugging, analytical, and problem-solving skills.
- Ability to work with cross-functional teams
- Familiarity with LLM-as-a-Judge evaluation methodologies.
Required Skills
[Java, Selenium]
Additional Information
faithfulness, grounding precision/recall, toxicity, latency, and cost.