Site Reliability Engineering Manager, Apple Data Platform

Apple

Bengaluru, India

10-12 Years

Save

Posted 18 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Summary

At Apple, we believe that innovation flourishes in an environment where ideas are challenged, collaboration is encouraged and technology is pushed to its limits. This environment is only possible when diverse minds come together, bringing unique perspectives and experiences. Our people and their ideas inspire innovation in everything we do. Imagine what you could accomplish here! Join Apple and help us make the world a better place.

As the engineering manager you will help hire and build our SRE teams in Bangalore for Apple Data Platform (ADP). Our ADP Infra SRE teams are expanding to provide multi-site, follow the sun on-call and engineering support, scaling our tools and coverage for our partner teams. You will use your knowledge of SRE principles and technical experience to find the best engineers to support our portfolio, including Jupyter Notebooks, Spark, Flink, and other open source products used by our platform teams. You will be a hands on manager, making your own technical contributions, partnering with your peers in the US, and driving reliability through consistent execution.

Description

Apple Service Engineering (ASE) teams build and scale the platforms and infrastructure behind many of Apple's services (such as iCloud, iTunes, Siri, and Maps). We are the foundation on which Apple's software developers build the products that our customers love. We are looking for a passionate and dedicated Site Reliability Engineering Manager to provide technical leadership and build our team to help ensure our customers have the highest quality Apple Services experience.

You will be responsible for building, scaling, and mentoring a high-performing SRE team that champions SRE and SWE best practices, release engineering, and data-driven decision-making. You will establish strong cross-functional partnerships to ensure reliability and resiliency are embedded throughout the system lifecyclefrom design and development to deployment and operations. Your leadership will help ensure Apple's Data Platform services meet demanding availability, latency, resilience, and security requirements while continuously improving operational maturity. We are looking for a leader who is deeply passionate about operating mission-critical, globally distributed systems, preventing outages, learning from failures, and driving long-term reliability improvements

Responsibilities

Partner with US SRE counterparts and product engineering to define and execute the reliability engineering vision, strategy, and roadmap for Apple Data Platform managed services.
Lead day-to-day execution of reliability initiatives, including sprint planning, prioritization, and retrospectives with a strong focus on operational outcomes.
Establish and own SLIs, SLOs, and error budgets, using metrics and observability to guide engineering trade-offs and reliability investments.
Promote automation and operational efficiency through tooling, testing, deployment pipelines, and self-healing systems.
Mentor and develop engineers through regular one-on-ones, career planning, and performance feedback, fostering a culture of ownership and continuous improvement.
Proactive collaboration and presentation skills to effectively communicate ideas and represent the deliverables and needs of the SRE team with ASE leadership.
Collaborate with recruiting to attract and hire top reliability engineering talent.
Advocate for reliability, resilience, and operational excellence across multiple product and platform teams.
Production on-call and incident management responsibilities.

Minimum Qualifications

10+ years of experience in software engineering, systems engineering, or infrastructure engineering.
5+ years of experience in a management role focused on leading, hiring, developing and building teams.
Ability to weigh in on architectural decisions and align engineering execution with product and business needs
Hands-on experience with reliability engineering, SRE, or large-scale production operations.
Practical experience in Python, Golang, and/or Java.
Knowledge of the Linux Operating System, containers and virtualization, standard networking protocols, and components
Understanding of SRE principals, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts.
Experience with Cloud Computing technologies (particularly Kubernetes)
Ability to lead cross-functional collaboration and influence technical decisions across teams.
Excellent written and verbal communication skills

Preferred Qualifications

Experience in defining and operating SLO-based reliability and resiliency programs.
Strong knowledge of observability systems (metrics, logging, tracing) and qualification engineering.
Proficiency with the architecture, deployment, performance tuning, and troubleshooting of open source data analytics and processing technologies, especially Apache Spark, Flink, Trino, Druid, and/or other related software.
Working experience with AI, Large-Language Models, and other efficiency or automation tools.