Read before applying
Required to apply: Include answers to the 8 screening questions below at the top of your application.
We won't be able to review applications that don't include answers.
About the role
We are hiring a senior Go software architect to design and build our microservices, their communication patterns, and our public/internal APIs. Your primary mandate is to make service-to-service communication fast (p95/p99 latency), stable, and evolvable as the system grows.
What you'll doOwn microservices architecture and communication
- Define service boundaries, responsibilities, and evolution strategy.
- Design service-to-service communication patterns (sync/async), including failure behavior and backpressure.
- Define API contracts (gRPC/REST), protobuf conventions, and backwards-compatible versioning rules.
Build performance + reliability into the system
- Drive measurable improvements in p95/p99 latency, throughput, and stability.
- Establish patterns for timeouts, retries/backoff, idempotency, rate limiting, circuit breaking, graceful degradation.
- Identify bottlenecks via profiling/tracing and lead remediation across services.
Establish shared foundations (golden path for services)
- Build/standardize shared Go libraries and templates: logging, metrics/tracing, auth middleware, config, error semantics, request context, and client/server wrappers.
- Set standards for compatibility testing and contract validation (e.g., proto/API breaking-change checks).
Production readiness (in partnership with Platform/SRE)
- Define runtime requirements for services on Kubernetes (resource sizing, probes, rollout needs, concurrency limits).
- Collaborate on SLOs, dashboards, alerting expectations, and incident learnings.
Technical leadership
- Lead design reviews and ADRs, mentor engineers, and set quality bars for distributed systems engineering.
What we're looking for
- Strong experience designing and building distributed systems / microservices in production.
- Deep Golang proficiency (concurrency, performance profiling, memory/CPU optimization, code quality).
- Proven track record designing APIs and service contracts that evolve safely over time.
- Strong understanding of reliability patterns (timeouts/retries, idempotency, consistency tradeoffs, failure modes).
- Practical experience using observability to debug and improve systems (tracing/metrics/logging).
- Working knowledge of Kubernetes runtime concepts (deployments, rollouts, probes, resources) to design services that operate well in production.
Nice to have
- High-performance networking experience (gRPC streaming, connection pooling, load shedding, service mesh concepts).
- Experience with event-driven architectures (Kafka/NATS/etc.) and schema evolution.
- Multi-tenant SaaS architecture and isolation patterns.
- Security foundations for services (mTLS, authN/authZ, secrets handling).
What you will not be primarily responsible for
- Day-to-day Kubernetes cluster administration.
- Owning CI/CD pipeline implementation end-to-end.
- Acting as the main on-call/SRE function (unless explicitly agreed).
What success looks like (first 3 months)
- Clear documented microservices architecture (service boundaries + communication standards).
- Stable API/versioning rules adopted across services.
- A shared Go service template (golden path) used by the team.
- Measurable improvements in p95/p99 latency and reliability (fewer comm-related incidents, faster debugging).
Screening questions (required)
- Go concurrency: Describe a production issue you debugged involving goroutines/channels/locks. Root cause + fix
- p95/p99 latency: Example of improving cross-service latency. Baseline, changes, measurement
- Communication choice: When gRPC vs REST vs async events Example + tradeoffs.
- Retries/idempotency: How do you make retries safe Example strategy.
- Failure modes: Example of preventing/handling cascading failure. Patterns used
- API evolution: How do you version/evolve APIs/protobufs without breaking clients
- Observability: How would you find latency in a 10-service call chain First signals you check
- Consistency: Example tradeoff (strong vs eventual, transactions vs saga/outbox). What and why