A Site Reliability Engineer is a professional who acts as a warrior to monitor, protect customer applications, taking charge on operational tasks to ensure the efficient functioning of a system.
They are responsible for monitoring, automating, and improving the reliability, performance, and availability of any applications.
Mandatory to work on 24x7 rotational shifts and retail domain knowledge.
Must have knowledge of Production Application Support from Level 2 support.
Prefer to have someone experienced in Shopify support side.
Hands on experience in Monitoring, Logging, Alerting, Dashboarding, and report generation in any monitoring tools such as AppDynamics/Splunk/Dynatrace/Datadog/CloudWatch/ELK/Prome/New Relic).
This engagement is a customer using NewRelic, PagerDuty hence it is good to have this expertise.
Should know how to write SQL query to fetch data from Database and from observability tools.
Must have knowledge in ITIL framework specifically on Alerts, Incident, change management, CAB, Production deployments, Risk and mitigation plan.
Should be able to lead P1 calls, brief about the P1 to customer, proactive in gathering leads/ customers into the P1 calls till RCA.
Experience working with postman.
Should have knowledge of building and executing SOP, runbooks, handling any ITSM platforms (JIRA/ServiceNow/BMC Remedy).
Should know how to work with the Dev team, cross functional teams across time zones.
Should be able to generate WSR/MSR by extracting the tickets from ITSM platforms.