
Search by job, company or skills
Exp - 8 Years
Data Scientist
Design and implement end-to-end evaluation frameworks to assess performance, reliability, and safety of multi-agent AI systems
Lead experimentation and A/B testing efforts to systematically test hypotheses, validate model improvements, and track performance across agent iterations
Curate and maintain high-quality ground truth datasets to enable accurate, reproducible evaluation of multi-agent outputs Identify and address reliability and accuracy gaps across agent workflows, failure modes, and edge cases in production-like environments
Stay current on emerging research in agentic AI, LLM evaluation, and multi-agent coordination to continuously improve framework design Technical Skills
Proficiency in Python and ML frameworks
Hands-on experience with LLM APIs and agentic frameworks (LangChain, LlamaIndex, Semetic KernalI)
Familiarity with evaluation tooling (Ragas, DeepEval, LangSmith, or similar)
Experience with data pipelines, experiment tracking (MLflow, W&B), and CI/CD for ML workflows
Strong foundation in statistics, NLP, prompt engineering, experimental design, and A/B testing methodology
Proficiency in Azure ML, Azure OpenAI Service, and Azure AI Foundry for model deployment, evaluation, and orchestration
Familiarity with Azure Monitor and Application Insights for tracking reliability and performance of deployed agent systems
Job ID: 149183791
We don’t charge any money for job offers