Job Description Summary
The Director Cloud Operations provides leadership, innovation, and oversight for SRE and CloudOps across PCS. The role establishes the operating foundations, metrics, and automation needed to run missioncritical, greenfield applications with high reliability and security, and is accountable for meeting product SLAs while scaling Cloud Operations and institutionalizing modern SRE practices in close partnership with product, platform, and security teams.
Job Description
Essential Responsibilities
Responsibilities
Serve as the functional leader for the PCS Digital Cloud Operations team. Define the operating model, governance, and KPIs; drive automation and observability; and ensure secure, reliable deployments across environments with continuous improvement and tight collaboration with security. This role reports to the VP of Engineering PCS Apps & Platform. Key responsibilities include:
- Own Cloud Operations for PCS cloud applications; stand up and scale CloudOps capabilities to support multiple products while adhering to committed SLAs.
- Institutionalize SRE practices: implement SLI/SLO/SLA frameworks, error budgets, incident/postmortem processes, and reliability runbooks; champion automation to reduce toil and improve service health and monitoring.
- Build endtoend observability (APM/RUM, logs, metrics, traces, health dashboards, proactive alerting) and evolve toward autohealing and AIOps for anomaly detection and closedloop remediation.
- Drive change, incident, and problem management with clear RACI and stakeholder communications; reduce MTTR through streamlined L1L4 escalation.
- Establish and test DR/BCP posture; conduct AWS WellArchitected and operational readiness reviews for services (AWSfirst, with multicloud considerations as needed).
- Lead FinOps practices: cost allocation and accountability, rightsizing, savings plans/reserved instances, spend governance, and uniteconomics optimization.
- Evolve the operating model in partnership with platform and application teams; standardize CI/CD templates and everythingascode for speed and repeatability.
- Build and develop a highperforming team: hire, coach, and grow CloudOps/SRE talent and the next set of leaders; uphold high standards for quality and customer satisfaction.
Core KPIs & Outcome Metrics
- Service availability versus SLA/SLO and errorbudget burn rate.
- MTTD/MTTR and incident recurrence; % incidents with postmortems completed.
- Change failure rate and lead time for changes for production deployments.
- % automated runbooks/toil reduction; % services with complete SLI/SLO coverage.
Basic Qualifications
- Bachelor's degree in computer science or a STEM field.
- A minimum of 10 years experience in leading technical teams in complex, fastpaced environments, including 5+ years of in Cloud Ops and SRE leadership roles
- Proven expertise in the areas of DevSecOps, Day2 Ops, APM/RUM, and Cloud Operations.
- Proficiency building and operating services on public cloud (AWSfirst) with CI/CD and InfrastructureasCode (e.g., Terraform/CloudFormation).
- Track record establishing SLIs/SLOs/SLAs, observability, and incident/change management at scale.
- Strong leadership and team management skills, with the ability to inspire and motivate a team of engineers.
- Excellent project management skills, with the ability to manage multiple complex projects simultaneously.
- In-depth knowledge of SaaS technologies, cloud computing, and medical device development processes.
Desired Characteristics
Technical Competencies
- Experience scaling CloudOps/SRE for multiple products and customer deployments.
- Deep fluency in SLI/SLO/SLA design, error budgets, runbooks, and autohealing patterns.
- Strong AWS architecture and operations; WellArchitected reviews; capacity and cost optimization (FinOps).
- Modern observability (APM/RUM/logs/metrics/traces) and AIOps for predictive analytics/anomaly detection.
- Security by design (DevSecOps, policyascode) and DR/BCP planning/testing.
Leadership Competencies
- Clear, decisive communicator able to influence across product, platform, and security stakeholders.
- Buildercoach mindset: hire, mentor, and grow managers and ICs; create leaders of leaders.
- Change agent who challenges the status quo while maintaining high standards for quality and customer satisfaction.
- Operates with ownership, bias for action, and strong judgment in an ambiguous, highgrowth environment.
Top 5 Critical Competencies & Skills
- SRE & Reliability Leadership SLI/SLO/SLA management, error budgets, disciplined postmortems.
- Cloud Operations at Scale (AWSfirst) operational readiness, DR/BCP, change/incident/problem management, and WellArchitected operations. Observability & AIOps endtoend telemetry, APM/RUM, automated remediation to reduce MTTR and toil.
- DevSecOps & PolicyasCode securebydefault pipelines and vulnerability management with measurable SLAs.
- FinOps & Cost Governance cost allocation, rightsizing, and spend optimization to improve unit economics while scaling.
Additional Information
Relocation Assistance Provided: No