Job Description: AI QA Engineer
Experience: 7+ Years
Location: Gurugram
Work Mode: Hybrid - 3 Days WFO
Job Summary
We are seeking an AI QA Engineer to ensure the quality, accuracy, and performance of our
enterprise-grade Natural Language to SQL (NL2SQL) pipeline. You will be responsible for
validating a complex, multi-stage AI architectureincluding semantic routing, LLM-based
disambiguation, and query generationensuring it securely and accurately translates user
intent into valid queries within the BFSI domain.
Key Responsibilities
- LLM & Pipeline Evaluation: Design and execute automated evaluations for a 4-stage
NL2SQL pipeline using LangSmith. Monitor metrics such as structural F1, execution
accuracy, latency, and token cost.
- Dataset Management: Create, curate, and maintain benchmark/golden datasets for
continuous regression testing of LLM prompts and model outputs.
- Search & Retrieval Testing: Validate precision and recall trade-offs in semantic search and
schema discovery, ensuring optimal candidate selection for downstream query generation.
- Failure Analysis & Debugging: Perform root cause analysis across pipeline stages (routing,
disambiguation, query generation, execution), identifying issues such as schema mismatches,
type/coercion errors, runtime incompatibilities, and query structure failures.
- E2E & API Automation: Develop automated test scripts using Python (Pytest) for
backend API testing and Playwright for the React frontend, validating end-to-end user
workflows.
- Observability & Debugging: Utilize Grafana and structured JSONL logs to identify
pipeline bottlenecks, LLM hallucinations, or prompt degradation.
- Compliance & Security: Ensure the AI pipeline meets strict BFSI data security standards,
validating execution safety mechanisms (e.g., runtime capability probing, injection
prevention); Ability to design validation rules and guardrails for AI pipelines to prevent invalid
query generation and runtime failures.
Required Skills
- AI/LLM Testing: Experience testing LLM applications, RAG (Retrieval-Augmented
Generation) pipelines, or NLP models. Familiarity with AI evaluation frameworks (e.g.,
LangSmith, DeepEval, or similar).
- Languages: Strong proficiency in Python 3.12+ (crucial for integrating with the existing AI
backend and Pytest suite). Secondary experience with JavaScript/TypeScript.
- Test Automation: Expertise in API testing (REST) and optional UI automation using
Playwright.
- Data & Search: Understanding of Vector Databases (e.g., Milvus, Pinecone) and semantic
search concepts (embeddings, hybrid search).
- Data & SQL Validation: Solid understanding of SQL and data validation techniques to
verify correctness of complex query outputs.
- Tools & Infrastructure: Git, Docker, CI/CD pipelines, and observability tools
(Prometheus/Grafana).
Education
- BE / BTech / MCA / BSc in Computer Science, Data Science, or a related field.
Nice to Have
- Familiarity with Graph Databases (Neo4j) and LangGraph orchestration.
- Experience evaluating foundational LLM models (OpenAI, Anthropic, Google).
- Prior exposure to query languages like SQL or PURE or any other functional programming
language.
- Experience testing workflows across multiple services or pipelines, with an understanding
of failure handling, retries, and system reliability concepts.
- Experience in Banking, Financial Services, or Insurance domains
- Understanding of data security, compliance, and enterprise