What You'll Be Doing:
- SRE lead with capability to execute SRE lifecycle and automation process.
- Discover, design, and implement changes to existing IT infrastructure with a focus on improved reliability, performance, and standardization.
- Collaborate with Engineering and business units to translate customer, business, and technical requirements into SRE architectural designs and enhancements.
- Develop and analyze various business and technical scenarios to drive the highest levels of executive decision-making around infrastructure resources. Drive consensus and decisions with stakeholders.
- Troubleshoot production issues providing root cause analysis and designing solutions to prevent future occurrences.
- Build automated, scalable, and rigorous solutions to infrastructure problems by leveraging or developing state-of-the-art automation, mathematical optimization, and/or AI models.
- Monitor services and create intelligent alarming for quicker incident detection and resolution.
- Identify opportunities to invent and simplify processes, identifying business risks and implementing resolutions and scalable mechanisms.
- Ensure efficient resource utilization and continuously improve processes leveraging automation and internal tools resulting in enhanced service delivery, maturity, and scalability.
- Mentor and coach other SRE team members.
The Impact You Will Have:
- Enhance the reliability and performance of Synopsys IT infrastructure.
- Standardize and automate processes to increase operational efficiency.
- Translate complex requirements into actionable SRE designs and solutions.
- Provide critical insights and drive decision-making for infrastructure improvements.
- Prevent future production issues through meticulous root cause analysis and proactive solutions.
- Contribute to the scalability and robustness of our infrastructure through innovative solutions.
- Enhance incident detection and resolution times, ensuring minimal disruption.
- Streamline processes to mitigate business risks and improve scalability.
- Optimize resource utilization, ensuring cost-effective and efficient operations.
- Develop the next generation of SRE talent through mentorship and coaching.
What You'll Need:
- Extensive experience with a wide range of infrastructure technologies, such as Linux, Windows, High-performance computing, storage platforms, networking, cloud computing, cloud services (IaaS, PaaS, SaaS), virtualization, OpenStack, containerization, and orchestration technologies (e.g., Docker, Kubernetes).
- Expertise in HPC components like NFS/Shared File systems and Grid Schedulers (IBM spectrum LSF / Univa Grid / SLURM).
- Deep understanding of IT infrastructure-related services and their dependencies required to troubleshoot issues and define mitigations.
- Strong command and understanding of statistical concepts/models/analysis and how they relate to product reliability & life cycle analysis.
- Experience developing quantitative and qualitative analysis and metrics to solve business problems.
- Experience with developing service level indicators and objectives, instrumenting software, and building alerts.
- Hands-on experience with one or more of Java/Python/Go/AngularJS/NodeJS languages.
- Implementation experience in infra-automation tools and frameworks like GitHub, Maven/Gradle, Jenkins, Terraform (IaC), Ansible, Shell scripting.