Partner with application developers and solution architects to ensure services are built for scale and performance.
Lead setting service-level objectives, agreements and indicators (SLOs, SLAs and SLIs) for the underlying service by collaborating with Application Development, Product and Business Owners
Design, Develop and create Scripts/Software/Tools that will improve the reliability of systems in Production including fixing issues, responding to incidents and taking on-call responsibilities.
Improve the overall resilience of a system and provide visibility to the health and performance of services across all applications and infrastructure
Improve service performance metrics like latency, page load speed and ETL and help proactively identify performance issues across the system
Implement monitoring solutions, create Dashboards and Alerts based on four golden signals of SRE providing single source to determine the overall performance and availability of the services they support.
Writing, updating, and using documentation, including runbooks/playbooks
Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
Using Chaos Engineering to test what you build under real-world conditions
Spread information across DevOps and business teams encouraging a blameless culture focused on workflow visibility and collaboration
Root-cause analysis complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance.
Services as technical owner to ensures delivery for SRE initiative
Performs deliverable reviews and coaches team in area of expertise in SRE
Provide continuous competitive and best-practices research, leverage industry resources and market trends, and liaise with internal stakeholders.