About Abacus.AI:
Abacus.AI is an AI research + SaaS company helping enterprises build, deploy, and operate real-time deep learning systems in production. We work with some of the most data-driven teams globally and we're scaling fast.
The Role:
We're looking for a Cloud Infra Support Engineer to own production reliability, work closely with customers, and act as an on-call operations support engineer during US (PST) night hours.
What You'll Do:
- Own and support production cloud infrastructure (AWS / GCP / Azure)
- Act as on-call ops support during PST night hours (rotational, not continuous)
- Monitor systems, alerts, and dashboards; respond to incidents and outages
- Troubleshoot customer escalations and production issues
- Work hands-on with Kubernetes-based platforms and customer onboarding
- Build and improve monitoring, alerting, and automation
- Represent customer needs and influence platform and infra improvements
What We're Looking For:
- 2+ years in Cloud Infra / DevOps / SRE / Ops roles
- Strong experience managing production environments
- Hands-on with Kubernetes, Terraform (IaC), CI/CD
- Solid monitoring & observability experience (Prometheus, Grafana, Datadog, CloudWatch, etc.)
- Proven on-call / incident response experience
- Comfortable working night shifts aligned with US (PST) time
- Familiarity with Spark, TensorFlow, GPUs, or MLOps (nice to have)
- Strong troubleshooting and customer-facing communication skills
What We Offer
- Competitive salary and equity package
- Opportunity to work with cutting-edge AI technology
- Collaborative and innovative work environment
- Professional development and learning opportunities
Culture : We believe in giving everyone autonomy and ownership and don't believe in over-management. We have a hands-off work-from-home environment where each individual has personal responsibility.