Opportunity
We are looking for SREs who want to define what reliability means for the next generation of industrial software. Defining SLIs/SLOs, building observability platforms, and establishing incident management processes.
Responsibilities
- Define and implement SLI/SLO frameworks for complex engineering systems across manufacturing and industrial clients
- Design and deploy observability platforms using Prometheus, Grafana, and Datadog
- Establish incident management processes and lead blameless post-mortems
- Implement chaos engineering practices to proactively identify system weaknesses
- Drive toil elimination through automation and platform improvements
- Build reliability engineering capabilities within the practice and client organisations
Essential Skills
- SLI/SLO definition and implementation at enterprise scale
- Observability: Prometheus, Grafana, Datadog, New Relic
- Incident management and post-mortem facilitation
- Chaos engineering: Gremlin, Chaos Monkey, Litmus
- Python testing for reliability validation and automated runbooks
- Automation and scripting: Python, Go, Bash
- Cloud platforms: AWS, Azure, GCP
Experience
510 years in SRE or Production Engineering roles with experience in enterprise or industrial environments