About the Role
We are looking for a Senior Engineer to join our Cloud Platform team and contribute to the development, operation, and reliability of our multi-tenant SaaS platform. You will work on backend systems and cloud infrastructure — building features, fixing real production problems, and growing into broader ownership over time.
This is a hands-on role with real ownership. You are expected to take a feature or problem end-to-end, write production-quality code, and work closely with more senior engineers on design and architecture.
What You Will Do:
Engineering & Development
- Build and maintain platform components across control plane and data plane
- Implement features for tenant provisioning, configuration management, and cluster operations
- Write clean, well-tested, production-grade code and participate actively in code reviews
- Debug and resolve issues in cloud-native, distributed production environments
Operations & Reliability
- Partner with SRE on observability, alerting, and incident response for services you own
- Improve reliability and operability of platform systems — reduce toil, improve monitoring, fix recurring issues
- Contribute to on-call and develop strong instincts for production system behaviour.
Collaboration & Growth
- Work closely with senior and principal engineers on design discussions and RFCs
- Collaborate with cross-functional teams across geographies
- Participate in hiring — conduct interviews and contribute feedback
Must Have:
Cloud & Infrastructure
- Solid hands-on experience with at least one of AWS, GCP, or Azure — compute, networking, IAM, and managed services
- Working knowledge of infrastructure as code — Terraform or equivalent — for provisioning and managing cloud resources
- Comfortable working with containerized workloads — Docker, Kubernetes basics (deployments, services, config maps, RBAC, namespaces)
Backend Engineering
- Strong programming skills in at least one backend language — Go, Java, Python, or equivalent
- Experience building and operating REST or gRPC APIs in production
- Good understanding of databases — relational and NoSQL — and how to use them reliably at scale
Distributed Systems
- Practical understanding of distributed systems — service dependencies, failure modes, retries, timeouts, and basic HA patterns
- Hands-on with observability — structured logging, metrics dashboards (Grafana, Datadog, or equivalent), and basic alerting; able to diagnose production issues using these tools
Security Fundamentals
- Awareness of cloud security basics — IAM least-privilege, secrets management, and network access controls
AI-Augmented Engineering
- Uses AI coding assistants — Claude, Cursor, Copilot — as a regular part of daily workflow for writing, debugging, and reviewing code
- Comfortable using AI tools to understand unfamiliar codebases, generate boilerplate, draft documentation, and speed up routine tasks
- Knows to review AI output carefully and apply judgment before committing or deploying
- Good to Have
- Exposure to Kubernetes operators, controllers, or CRDs
- Familiarity with GitOps workflows — ArgoCD, Flux, or equivalent
- Basic understanding of multi-tenancy concepts — isolation, resource quotas, or tenant lifecycle
- Experience contributing to or building internal observability dashboards and alerting pipelines
What Success Looks Like
In 3 months:
- Ramped up, productive, and shipping features with guidance
- Comfortable with the codebase, deployment process, and team workflow
- Engaging actively in code reviews and team discussions
In 6 months:
- Owning features end-to-end with increasing independence
- Reliable on-call contributor — able to diagnose and resolve common production issues
- Proactively improving the quality and observability of systems you work on
In 12 months:
- Trusted, high-output contributor on the team
- Beginning to mentor interns or junior engineers
- Taking on broader problems that span multiple components or systems