Site Reliability Engineer
What will you do:
You'll be a key part of our Infrastructure Platform team, focusing on the critical infrastructure that powers . Beyond core infrastructure work, you'll also collaborate closely with a product development team, offering your expertise to coach and guide them on infrastructure and architectural decisions.
In your day-to-day, you will:
- Build and maintain our production infrastructure to ensure scalability and high availability, while maximizing development team efficiency.
- Troubleshoot and debug issues related to both product and infrastructure.
- Automate everything! If something's worth doing, it's definitely worth automating.
- Improve and extend our Kubernetes platform, which leverages EKS.
- Provide crucial insights into scalability for our developers.
Participate in an on-call rotation to support our production systems.
Who you are:
- You're someone who loves ownership: you design it, you build it, you own it! You're a self-motivated individual and a strong team player within the Infrastructure Platform team.
- You have at least 2 years of experience working as a DevOps Engineer (or a similar role like Software Engineer or Cloud Engineer).
You have proven experience in architecting systems based on both functional and non-functional requirements.
Your qualifications
You should be proficient in, or have solid knowledge of:
Observability & Reliability
SLO/SLI Management:
- Experience defining and implementing Service Level Objectives (SLO) and Service Level Indicators (SLI) to measure service health.
- Modern Observability: Proficiency with high-cardinality observability platforms; Honeycomb experience is a major plus, but experience with similar tools (e.g., New Relic, Datadog) is welcome.
- Pro-active Monitoring: Proven ability to move beyond basic threshold alerts toward trend-based, pro-active alerting and distributed tracing.
Incident Response: Experience with blameless post-mortems and a focus on reducing toil through automation.
Infrastructure & Orchestration
- Containerization: Proficient in Container Orchestration and technologies such as Kubernetes and Docker.
- Service Mesh: Experience with Istio for traffic management, security, and microservices observability.
- Public Cloud: Strong hands-on experience with AWS.
- Linux: Deep knowledge of Linux-based systems.
Automation & Data - CI/CD: Experience with Jenkins or GitHub Actions.
- Cloud Orchestration: Proficiency in Terraform and Ansible for automation and service configuration.
- Data Engines: Familiarity with SQL, NoSQL, OpenSearch, and AWS S3.
- Programming: Proficiency in at least one of our core languages: Python, TypeScript, or Java.