Lead , mentor and grow a team of 3 to 4 Site reliability engineers
Mange Chaos engineering* and DR schedules
Crisis management
Define ,Implement and advocate Site reliability engineering (SRE) best practices like SLAs,SLOs,SLIs , error budget
Validate Capacity planning
Own observability stack for applications and Infrastructure (monitoring , alerting , logging , tracing back for root cause)
Performance baselining and identifying bottlenecks , automate response wherever feasible
Manage Incident response and problem management practice including rosters , on call rotation , runbooks , ruthless objective postmortems.
Contribute to EA NFRs from performance perspective
Will need engineering graduate with hands- on support , troubleshooting experience across Infra and application , logical , analytical approach , skill of corelating and elimination and good stakeholder communication.
12 to 15 years experience in demanding setup ( banking , ecommerce ), avoid small startup as process as candidates could be low on compliance and make things work any how/somehow.
Team size 3 to 4 people to start with, additionally few resources from current Incident management team and even DR management team can roll into him.