AI Evaluation Engineer

Apple

Hyderabad, India

4-6 Years

This job is no longer accepting applications

Posted 2 months ago

Job Description

Summary

Imagine what you could do here. At Apple, new ideas have a way of becoming outstanding products, services, and customer experiences very quickly. Bring passion and dedication to your job, and there's no telling what you could accomplish.

Apple's Sales organization generates the revenue needed to fuel our ongoing development of products and services. This, in turn, enriches the lives of hundreds of millions of people around the world. We are, in many ways, the face of Apple to our largest customers.

Apple's US Decision Intelligence (DI) team is looking for a talented individual who is passionate about crafting, implementing, and operating AI solutions that have a direct and measurable impact on Apple Sales and its customers.

Description

We're seeking a visionary AI Evaluations Engineer to own the end-to-end evaluation pipeline for our AI products. This role will focus on implementing and maintaining evaluation frameworks, instrumentation, and workflows that help us understand how well our AI systems perform, where they fail, and how they improve over time.

This role will operate in both capacities, to augment existing AI roadmap, as well as innovate and trailblazing new frontier tech projects, crafting AI experiences that reduce time to insights and catalyze decision making.

Responsibilities

Build and operate AI evaluation workflows that measure the quality of LLM outputs across chat, summarization, recommendations, and agent-based features.
Implement LLM-as-a-Judge and rubric-based evals to score outputs for correctness, relevance, grounding, and consistency.
Instrument LLM and agent workflows to capture traces, prompts and responses, metadata, and user feedback.
Support release readiness by running evals before launches and highlighting regressions or quality risks.
Help define agent-specific evals (task completion, tool correctness, error recovery).
Partner with AI engineers and AI platform teams to translate product requirements into evals criteria.
Review eval results and recommend improvements.
Contribute to system design for observability, retries, and logging.

Minimum Qualifications

4+ years of experience in data and AI-related fields such AI engineering, software development, ML engineering, data science, or QA roles.
We're looking for someone with an eagerness and ability to learn new skills and solve dynamic problems in an encouraging and expansive environment.
Working across global teams to ensure alignment of product development.
Strong Python skills.
Applied knowledge of GenAI and RAG strategies, micro-services, recommendation systems, and context engineering.
Familiarity of AI evaluation techniques, such as Golden datasets, LLM-as-judge, or rubric-based scoring.
Experience with different LLM ecosystems (OpenAI, Anthropic, Gemini, etc.), RAG pipelines, vector databases (e.g., Pinecone, FAISS, Milvus, PostgreSQL).
Proficiency in SQL and experience with at least one major data analytics platform, such as Hadoop, Spark, or Snowflake.
Experience with CI/CD or release validation workflows.
Familiarity with telemetry and evaluation frameworks for AI agents.
Experience working with data science teams on insights generation leveraging LLMs.
Knowledge of project management, and productivity tools such as Wrike and Miro.
Strong time management skills with the ability to collaborate across multiple teams.
Able to balance competing priorities, long-term projects, and ad hoc requirements.
Ability to work in a fast-paced, dynamic, constantly evolving business environment.
B.S. Degree in Computer Science/Engineering, or equivalent work experience

Preferred Qualifications

Hands-on experience with Langfuse or similar tools for LLMs observability.
Sound communication skills - expert at messaging domain and technical content, at a level appropriate for the audience. Strong ability to gain trust with stakeholders and senior leadership.
Familiarity with embeddings, retrieval algorithms, agents, and data modeling for vector development graphs.
Other complementary technologies for distributed systems architecture and asynchronous messaging, agent communication, and catching like RabbitMQ, Redis, and Valkey are preferred.
Advanced Degree (MS or Ph.D.) in Economics, Electrical Engineering, Statistics, Data Science, or a similar quantitative field is preferred.