Search by job, company or skills

A

Software Engineer - AI and Data Platforms

Save
new job description bg glownew job description bg glow
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Summary
The Applied Machine Learning team within the AI and Data Platform organization is at the forefront of driving digital transformation through machine learning across Apple's enterprise ecosystem. We build and operate large-scale ML, GenAI, inference, and data platforms that power business-critical workflows across Apple.

Our systems sit on the critical path of real-time decisioning—every transaction across Apple Online Store, Retail, Media, and Support systems depends on our platform's ability to make fast, accurate fraud decisions. This requires solving complex challenges in distributed systems, extreme scale (up to 100x traffic bursts), and low-latency processingusing a diverse set of open-source and cutting-edge technologies.

This role is part of the Reliability & Platform Engineering (SRE) team, but it is not a traditional or legacy operations role. Instead of reactive support, the focus is on building the platform itself—designing scalable systems, creating intelligent automation, and developing tools that redefine how reliability is engineered.

You will work on developer-first platform capabilities, apply AI/GenAI-driven approaches to observability and operations, and take ownership of systems end-to-end. This role combines software engineering, distributed systems, and platform architecture, with a strong emphasis on building solutions—not just operating them.

Description
We are looking for engineers with strong coding skills and solid computer science fundamentals who are passionate about building resilient, high-performance distributed systems and platform infrastructure.

As a Software Engineer in AI & Data Platform Reliability Engineering, you will work on systems powering GenAI, ML inference, and real-time fraud decisioning at scale. This is a hands-on engineering role focused on system design, platform development, and intelligent automation.

You Will
Design and build developer-first platform components that enable seamless onboarding and execution of ML workflows

Develop automation, internal tools, and AI-driven solutions to enhance observability, reliability, and operational efficiency

Build and operate multi-tenant, distributed systems handling high-throughput and highly concurrent workloads

Work on systems that scale to extreme traffic spikes (up to 100x BAU) with strict latency and availability requirements

Collaborate with cross-functional teams to deliver impactful platform capabilities and customer-facing features

Lead projects end-to-end—from architecture and design to deployment and production excellence

Continuously improve system performance, scalability, and resilience

Proactively identify, diagnose, and solve complex system and production challenges

We are looking for engineers who enjoy going deep into systems, understanding how they behave at scale, and building smart, scalable solutions on cloud-native infrastructure (Kubernetes, hybrid cloud).

Responsibilities
  • Design, develop, and own core platform components for fraud decisioning and ML inference systems (Athena platform)
  • Build automation frameworks and developer tooling to improve platform usability and operational efficiency
  • Ability to identify recurring operational challenges (toil) and design thoughtful, scalable solutions to address them
  • Architect and implement advanced observability solutions, including AI/GenAI-driven insights and automation
  • Ensure high availability, resilience, and scalability of critical distributed systems
  • Drive performance optimization, reliability engineering, and capacity planning
  • Collaborate closely with ML engineers and application teams to enable seamless workflow onboarding
  • Contribute to infrastructure as code, CI/CD pipelines, and platform standardization
  • Take ownership across the full system lifecycle—design, implementation, deployment, and production operations
  • Continuously improve the platform by identifying opportunities for automation, abstraction, and simplification

Minimum Qualifications
  • Bachelor's degree in Computer Science, Computer Engineering, or equivalent technical field
  • 3+ years of strong programming experience in Python or similar languages
  • Solid foundation in data structures, algorithms, operating systems, and distributed systems fundamentals
  • Experience with cloud-native technologies (Kubernetes, containers, AWS or similar platforms)
  • Familiarity with infrastructure as code and automation tools (e.g., Terraform, Ansible)
  • Ability to read, understand, and work effectively with large open-source codebases

Preferred Qualifications
  • Excellent analytical & problem solving skills.
  • Exposure to Machine Learning and GenAI technologies.
  • Exposure to datasets management and cost optimisation in cloud.
  • Exposure to Ray and Ray Serve, for building scalable, distributed, and model-serving platform components.

At Apple, we believe accessibility is a fundamental human right. You'll find that idea reflected in everything here — in our culture, our benefits and our digital tools. By welcoming as many perspectives as possible, we help you build a career where you feel like you belong.

Learn about accessibility in Apple's workplace

Role Number: 200609512-1052

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147492261

Similar Jobs

Hyderabad, India

Skills:

JavaScalaBig Data TechnologiesSqlNosqlReactHiveJavascriptGcpSparkContainersRest ApisDevops ToolsKubernetesPythonAWSLangChaindata platformsLLM serving and inference frameworkscloud native applicationsTrinoSQL query enginesLlamaIndex

Hyderabad

Skills:

JavaMicro ServicesPython