Pearson is looking for a dynamic and experienced Manager - Site Reliability Engineering (SRE) to join our team. This individual will play a critical role in ensuring the stability, performance, and scalability of our infrastructure. If you possess excellent leadership skills, profound technical expertise, and the ability to thrive in a fast-paced, collaborative environment, we encourage you to apply.
Key Responsibilities
Leadership and Team Management
- Lead, mentor, and develop a team of highly skilled Site Reliability Engineers.
- Promote a culture of continuous improvement and high performance.
- Foster collaboration and communication within the team and with other departments.
- Monitor team performance and provide constructive feedback.
Technical Expertise
- Oversee the design, implementation, and maintenance of reliable and scalable infrastructure.
- Develop and enforce best practices for system reliability, monitoring, and incident management.
- Ensure the availability, performance, and security of our services.
- Collaborate with software engineering teams to design and implement solutions that improve system reliability and performance.
- Utilize automation and DevOps practices to streamline operations and enhance productivity.
- Experience with Terraform is required.
- Extensive knowledge of multi-cloud environments is an added advantage.
Collaboration and Communication
- Work closely with cross-functional teams, including engineering, product management, and operations, to ensure alignment and successful project execution.
- Communicate effectively with stakeholders at all levels, providing regular updates on SRE initiatives and performance metrics.
- Facilitate incident response and post-mortem meetings, ensuring thorough analysis and follow-up on action items.
Qualifications
- Education: Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Experience: Proven experience in a leadership role within a Site Reliability Engineering or DevOps team.
- Strong technical background with extensive knowledge of cloud infrastructure, containerization, automation, and monitoring tools.
- Proficiency in scripting languages such as Python, Bash, or similar.
- Excellent problem-solving skills and a proactive approach to identifying and mitigating risks.
- Exceptional communication and interpersonal skills.