Description
We are seeking a Site Reliability Engineer (Incident Management) to join our team in India. The ideal candidate will have a strong background in managing incidents and ensuring the reliability of our services. You will play a crucial role in monitoring systems, responding to incidents, and implementing processes that enhance our incident management capabilities.
Responsibilities
- Monitor and respond to incidents in a timely manner to minimize downtime and impact on services.
- Analyze incidents to identify root causes and develop solutions to prevent recurrence.
- Implement and maintain incident management processes and tools.
- Collaborate with development and operations teams to ensure smooth incident resolution.
- Conduct post-incident reviews to assess effectiveness and improve response strategies.
- Create and maintain documentation related to incident management procedures.
- Develop and deliver training on incident management best practices.
Skills and Qualifications
- Bachelor's degree in Computer Science, Information Technology, or related field.
- 5-10 years of experience in Site Reliability Engineering or related roles.
- Strong knowledge of incident management frameworks and methodologies.
- Proficiency in monitoring and incident response tools (e.g., PagerDuty, Opsgenie, etc.).
- Experience with cloud platforms such as AWS, Azure, or Google Cloud.
- Familiarity with scripting languages (e.g., Python, Bash) for automation tasks.
- Understanding of software development lifecycle and deployment processes.
- Excellent problem-solving skills and the ability to work under pressure.
- Strong communication skills to collaborate effectively with cross-functional teams.