Key Skills:Splunk, New Relic, Site Reliability Engineer, GCP, Azure, Monitoring
Roles and Responsibilities:
- Collaborate with Product, Engineering, Security, Operations, Infrastructure teams, and external vendors to improve system reliability.
- Design, develop, and implement monitoring and alerting for infrastructure and applications to ensure high availability and performance.
- Monitor production environments and respond to incidents to reduce downtime and service impact.
- Analyze recurring issues and recommend long-term solutions to improve system stability.
- Drive automation initiatives to reduce manual effort and operational risk.
- Proactively identify opportunities to optimize performance, reliability, and cost efficiency.
- Maintain and enhance documentation, runbooks, and operational procedures for SRE teams and stakeholders.
- Participate in post-incident reviews and contribute to continuous improvement initiatives.
Skills Required:
- Strong experience in Site Reliability Engineering practices, including availability, performance, and incident management.
- Hands-on expertise in monitoring and observability tools such as Splunk and New Relic.
- Experience designing and implementing infrastructure and application monitoring solutions.
- Good understanding of cloud platforms such as Azure and GCP.
- Ability to analyze system behavior and troubleshoot complex production issues.
- Experience improving operational metrics such as Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
- Strong problem-solving skills with a mindset toward automation and continuous improvement.
- Ability to collaborate effectively with cross-functional technical and non-technical teams.
- Strong documentation skills for maintaining operational runbooks and knowledge repositories.
Education :Bachelor's degree in Computer Science, Engineering, or a related field.