Evaluate and ensure availability of components within their teams and identify how to bring all services within SLO (99.XX)
- Monitor systems for implemented automation and set SLI/SLOs along with respective stakeholders.
- Implementation of observability platform
- Review all ownership data and ensure it is current and complete.
- Review volume and accuracy of bugs assigned to the team and identify opportunities to improve automated triage.
- Identify CFBT (Customer Flow Based Testing) eligible flows, develop CFBT tests and train the team on how to write and maintain them.
- Lead post postmortems for any P1 or greater incidents during the rotation. Train the team on distributed problem management process.
- Operations and Design Consultation for driving high reliability.
- Emergency Incident Response with action-oriented postmortem/RCA/Incident debriefs.
- Driving continuous improvement through toil reduction and automation.
- Application Performance and availability analysis