The Role
As a Senior Site Reliability Engineer, you will own and operate cloud-native infrastructure and Kubernetes platforms that power customer-facing services at scale. You will design and optimize CI/CD workflows, improve deployment reliability, and drive observability, incident management, and performance improvements across the organization. You will build platform tooling to improve developer velocity, enforce security guardrails, and standardize best practices. This role expects strong ownership, architectural thinking, and mentorship of junior engineers.
What makes this role special
- Full Platform Exposure Work across DevOps, infrastructure, observability, performance, and reliability
- Architecture Ownership Influence platform and tooling decisions using benchmarks and metrics
- High Impact Build systems that reduce deployment TAT, improve p99s, and scale across teams
- Flexibility Freedom to work across stacks, tools, and evolving platforms
What skills & experience do you nee
- 25 years of experience operating customer-facing services at scale
- Strong hands-on experience with Kubernetes cluster operations and workload optimization
- Experience with service mesh and distributed tracing tools (e.g., Istio, Jaeger)
- Comfortable with at least one cloud provider (AWS preferred; GCP or Azure acceptable)
- Hands-on experience with monitoring and alerting stacks (Prometheus, Grafana, Thanos, Datadog, New Relic)
- Proven experience designing robust CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
- Proficiency in Infrastructure as Code (Terraform or Pulumi)
- Strong programming skills in Python, Go, or Java/Kotlin, plus shell scripting
- Experience with databases such as MySQL and MongoDB, including application and query profiling
- Solid understanding of security best practices and compliance
- High-ownership mindset with the ability to proactively identify and resolve platform issues