Oracle Cloud Infrastructure (OCI) is a strategic growth area for Oracle. OCI delivers a comprehensive cloud platform spanning Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). As OCI continues to scale its AI Infrastructure footprint globally, maintaining fleet availability, repair efficiency, and operational excellence is critical to customer success.
The Fleet Health organization is responsible for improving AI Infrastructure availability and repair process efficiency by bringing together engineering, operations, and program management under a unified operating model. The organization owns strategic customer fleet tracking, repair coordination, spares management, platform issue tracking, tooling requirements, and executive reporting.
We are seeking a Principal Technical Program Manager to join the AI Infrastructure Repair Program team in Bangalore. In this role, you will lead complex, cross-functional programs focused on AI Infrastructure repair operations, fleet health, customer availability, and operational governance. You will work across engineering, SRE, operations, supply chain, hardware partners, and executive stakeholders to drive measurable improvements in fleet availability, repair execution, and operational efficiency.
This is a high-visibility role requiring strong technical program management capabilities, operational rigor, and the ability to influence across organizational boundaries while operating in a rapidly scaling environment.
Program Ownership and Execution
- Lead one or more major AI Infrastructure Repair domains, including strategic customer availability, partner repair execution, RMA/spares governance, repair workflow optimization, data and reporting frameworks, or engineering platform tooling.
- Translate availability and operational gaps into structured programs with clearly defined scope, milestones, KPIs, owners, dependencies, risks, and success criteria.
- Establish and drive operational mechanisms that ensure accountability, visibility, and execution excellence across repair programs.
- Own program reviews, executive reporting, and operational readiness assessments.
Fleet Health and Availability Management
- Drive programs focused on improving AI Infrastructure fleet availability and repair performance.
- Monitor operational health indicators and identify systemic risks affecting customer availability.
- Lead cross-functional recovery efforts for high-priority fleet health issues and customer-impacting events.
- Partner with operations and engineering teams to ensure durable corrective actions are implemented.
KPI Governance and Executive Reporting
- Define, implement, and govern key performance indicators for repair health and fleet operations, including:
- Fleet Availability
- Unavailable Host Backlog
- Repair Cycle Time
- Repair Success Rate
- Reopen Rate
- Spare Availability
- RMA Performance
- Partner Responsiveness
- SLA/SLO Compliance
- Build scalable reporting frameworks and dashboards that provide leadership visibility into repair execution, operational risk, and customer impact.
- Translate data into actionable insights and executive recommendations.
Cross-Functional Leadership
- Drive alignment across engineering, SRE, data center operations, supply chain, hardware vendors, and partner organizations.
- Partner with globally distributed teams across India, Morocco, Mexico, and the United States to execute critical repair and fleet health initiatives.
- Influence stakeholders and drive decisions in a highly matrixed environment without direct authority.
- Facilitate discussions that align priorities, remove blockers, and accelerate execution.
Incident, Risk, and Change Management
- Support incident-style escalation management for clusters at availability risk.
- Establish governance mechanisms for issue prioritization, risk management, and escalation handling.
- Drive root cause analysis efforts and ensure effective corrective and preventive actions are implemented.
- Identify operational risks early and proactively develop mitigation strategies.
Continuous Improvement and Automation
- Lead process optimization initiatives that improve repair efficiency, reduce manual effort, and increase operational scalability.
- Standardize repair playbooks, escalation paths, governance mechanisms, and reporting processes.
- Partner with engineering teams to identify and prioritize automation opportunities that improve operational productivity.
- Advocate for tooling investments that improve fleet visibility, triage efficiency, and repair execution.
Minimum Qualifications
- Bachelor's degree in Computer Science, Engineering, Information Technology, or related technical field.
- 8+ years of experience in Technical Program Management, Program Operations, Release Management, Infrastructure Operations, or related disciplines.
- 5+ years leading large-scale, cross-functional technical programs from conception through execution.
- Demonstrated experience managing complex programs involving multiple stakeholders, dependencies, and operational risks.
- Strong analytical skills with experience developing metrics, dashboards, KPIs, and executive reporting.
- Proven ability to drive execution across engineering, operations, and business teams.
- Excellent verbal and written communication skills, including executive-level communication.
- Demonstrated ability to influence and drive accountability across matrixed organizations.
- Strong organizational skills with the ability to manage multiple competing priorities simultaneously.
Preferred Qualifications
- Master's degree in Engineering, Computer Science, Business Administration, or related field.
- Experience with cloud infrastructure, AI/ML infrastructure, GPU operations, fleet management, or large-scale distributed systems.
- Experience supporting AI training or inference environments.
- Familiarity with GPU platforms including NVIDIA and AMD ecosystems.
- Experience with hardware repair operations, supply chain programs, spares management, or RMA processes.
- Experience working with SRE, Infrastructure Engineering, or Data Center Operations organizations.
- Experience building operational reporting, telemetry, dashboards, or automation solutions.
- Knowledge of incident management, operational readiness, change governance, and service reliability practices.
- PMP, Scrum Master, SAFe, ITIL, or similar program management certifications preferred.
Success Measures
Success in this role will be measured through:
- Improvements in AI Infrastructure fleet availability and repair efficiency.
- Reduction in repair backlog and cycle time.
- Increased visibility into customer fleet health and operational risk.
- Effective execution of strategic repair and fleet health programs.
- Increased automation and reduction of operational toil.
- Strong stakeholder satisfaction and executive confidence in reporting and governance mechanisms.
Career Level - IC4