
Search by job, company or skills
Job Description
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability, cloud-native infrastructure, and large-scale distributed systems. This role is highly hands-on and focuses on designing, building, and operating reliable, observable, and scalable platforms running on Kubernetes, with a strong preference for Google Cloud Platform (GCP) and AWS.
Roles & Responsibilities
Reliability & Operations
- Design, implement, and maintain highly available and resilient systems in Kubernetes-based environments
- Define and enforce SLOs, SLIs, and error budgets
- Lead incident response, RCA, and postmortems
- Drive reliability improvements through automation
Observability (Core Focus)
- Architect and operate observability platforms for metrics, logging, tracing, and alerting
- Work with Prometheus, Alertmanager, OpenTelemetry, Grafana, Loki / ELK / OpenSearch
- Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred)
- Establish actionable alerting standards
Cloud & Platform Engineering
- Build and manage infrastructure on GCP (preferred) or AWS
- Operate Kubernetes clusters (GKE preferred)
- Deploy services using Helm
- Manage containerized workloads using Docker
Automation & Tooling
- Strong Python skills with emphasis on reliability, automation, and observability tooling
- Develop automation and tooling using Python
- Create internal reliability and monitoring tools
- Integrate CI/CD pipelines with observability and reliability checks
Collaboration & Leadership
- Mentor junior engineers
- Influence architecture decisions
- Collaborate across engineering teams
Job ID: 147202815
Skills:
Kubernetes, Github, Jira, Grafana, AWS, Prometheus, Bash, Python, Docker, Terraform, Confluence, Helm, Jenkins, Git, GitHub Actions, Loki, Go, GitOps, CircleCI, Infrastructure as Code, PagerDuty, CI CD systems
Skills:
RedHat, Golang, Perforce, Prometheus, Datadog, Svn, Docker, Terraform, Gitlab, Python, AWS, Cloudformation, Ubuntu, Jenkins, Cloudwatch, Gcp, Linux, Ansible, ECS, Centos, Kubernetes, AlertManager, Deployment Manager, Rancher, Thanos, GKE, Amazon Linux, EKS
Skills:
Performance Testing, Microservices, Jenkins, Terraform, Docker, Automation Frameworks, Helm, Kubernetes, Azure DevOps, observability frameworks, IaC, CI CD, GitHub Actions, chaos engineering
Skills:
Docker, Kubernetes, SRE practices, DevOps practices
Skills:
Cli, Programming, Grafana, Puppet
We don’t charge any money for job offers