
Search by job, company or skills
About the role We're hiring a Lead SRE to set the reliability bar for a multi-sided
marketplace serving global users (primarily US). You'll define how we run production
across multiple clouds and four environments, lead incident response, and build the
practices the rest of the SRE team operates by.
What you'll own
Reliability strategy: SLOs, error budgets, and uptime targets across all critical
services
Incident command for Sev-1s — triage, communication, resolution, and RCAs
that actually prevent repeats
Multi-cloud architecture decisions (Azure primary, GCP secondary) and the
operational contract between hosted frontend/backend platforms and our own
container workloads
Release engineering across web and mobile (Android + iOS) — staged rollouts,
OTA updates, rollback discipline, per-environment previews across dev, QA,
staging, prod
Monitoring and alerting strategy end-to-end: dashboards, signal-to-noise, paging
policy, and on-call health
Third-party reliability posture — payments (Stripe), comms (Twilio, LiveKit), and
AI/LLM relays — with monitoring and graceful-degradation playbooks
DR strategy, backup validation, and compliance-relevant operations for a global
marketplace
Hiring, mentoring, and on-call rotation design for the SRE team
Must have
7–10 years in production SRE roles, with time spent leading incident response at
scale
Deep experience operating SaaS on at least one major cloud (Azure preferred);
working knowledge of a second
Track record running services on hosted platforms (e.g., Vercel-class) alongside
owned container infrastructure
Mobile release pipelines for Android + iOS across multiple environments
Strong opinions on observability, alerting hygiene, and postmortem culture
Calm incident command; clear writing; comfort with cross-time-zone operations
Nice to have Marketplace or two-sided-platform experience, real-time communications
operations, LLM/AI service reliability, SOC 2 or CCPA exposure.
Job ID: 147491001
We don’t charge any money for job offers