Understanding project KPIs, SLIs, SLOs, MTTD, MTTR, Error budgets, Chaos engineering and eliminating TOILs by automation
Exploring observability tools and creating/implementing dashboards
Run the production environment by monitoring availability and taking a holistic view of
system health
Incident Management: Knowledge in handling incidents, participating in blameless
postmortem, performing root cause analysis, and implementing post-incident reviews.
What you bring:
4 to 6 Years of Site Reliability ( SRE) practices and Production support experience with Sybase,SQL,Unix,Shell Scripting commands.Management,SVN,Jenkins
Experience in supporting Unix/Linux/Windows based application environments
Worked on/with System and Application Monitoring and Observability tools - Splunk,Prometheus, Grafana, Dynatrace.
Hands on experience in preparing PowerShell/Python/Shell script automation.
Exposure to latest SRE, Cloud, DevOps technologies. Also, Knowledge of Containers,Dockers, Kubernetes/OpenShift tools.
Skills in using tools like Terraform Ansible to automate infrastructure management.
Added Bonus if you have:
Knowledge of FIS products and services
Knowledge of the business goals, objectives and business operations for the appropriate FIS organization