Role Summary
Owns the operational backbone : observability (OTEL / Prometheus / logs / traces), CI/CD for both application services and infra code, disaster recovery runbooks, and the delivery cadence that makes the platform operable day-to-day.
What You'll Do
- Stand up the observability stack : OpenTelemetry, Prometheus (tenant-labelled), Loki / Elastic, Grafana, alerting.
- Build CI/CD pipelines for application services (container build, test, sign, push) and for IaC (plan, policy-check, apply).
- Define and own the SLOs, error budgets, and alert routing for each service.
- Author and drill DR / failover runbooks cloud region flip, VPN primary-to-secondary, DC orchestrator failure.
- Own the release train : environment promotion, feature-flag rollout, canary playbook, rollback drills.
- Lead production incidents, post-mortems, and follow-up action tracking.
- Define capacity and cost dashboards; partner with the backend engineer on the metering-to-cost pipeline.
Must Have
- Strong Kubernetes ops Argo CD, GitOps, Helm, operator patterns.
- Terraform + one of (Pulumi / Ansible) for IaC; opinionated policy-as-code (OPA / Conftest / Sentinel).
- OpenTelemetry + Prometheus in production; distributed tracing.
- On-call experience, incident command, post-mortem culture.
- Comfortable with Go or Python for ops tooling.
Nice To Have
- SLO-driven autoscaling experience on GPU workloads.
- Multi-region A/P failover orchestration background.
- Sovereign-cloud / air-gapped CI/CD experience.
(ref:hirist.tech)