Here's the job description for a Site Reliability Engineer, formatted for clarity and impact:
About the Role
As a Site Reliability Engineer, you will be responsible for supporting and maintaining various Cloud Infrastructure Technology Tools in our hosted production/DR environments. You will serve as the subject matter expert for specific tool(s) or monitoring solution(s), taking charge of testing, verifying, and implementing upgrades, patches, and new implementations. You will also partner with other service functions to investigate and/or improve monitoring solutions. This role offers opportunities to mentor team members or provide training to cross-functional teams as required, and you may be assigned to produce regular and ad-hoc management reports.
Key Responsibilities
- Support and maintain various Cloud Infrastructure Technology Tools in hosted production/DR environments.
- Act as the subject matter expert for specific tool(s) or monitoring solution(s).
- Test, verify, and implement upgrades, patches, and new tool implementations.
- Partner with other service and/or service functions to investigate and/or improve monitoring solutions.
- Mentor one or more tools team members or provide training to other cross-functional teams as required.
- May motivate, develop, and manage performance of individuals and teams while on shift.
- May be assigned to produce regular and ad-hoc management reports in a timely manner.
- Design, develop, and maintain observability tools and infrastructure.
- Collaborate with other teams to ensure observability best practices are followed.
- Develop and maintain dashboards and alerts for monitoring system health.
- Troubleshoot and resolve issues related to observability tools and infrastructure.
Qualifications
- Bachelor's Degree in Information Systems or Computer Science or related discipline.
- Relevant experience of 5-8 years.
- Proficiency in Splunk/ELK, and Datadog.
- Experience with observability tools such as Prometheus/InfluxDB, and Grafana.
- Strong knowledge of at least one scripting language such as Python, Bash, Powershell or any other relevant languages.
- Experience with Enterprise Software Implementations for Large Scale Organizations.
- Extensive experience with new technology trends prevalent in the market like SaaS, Cloud, Hosting Services, and Application Management Services.
- Experience in deployment of application & infrastructure clusters within a Public Cloud environment utilizing a Cloud Management Platform.
- Professional and positive with outstanding customer-facing practices.
- Can-do attitude, willing to go the extra mile.
- Consistently follows-up and follows-through on delegated tasks and actions