Search by job, company or skills

spydra

Spydra - Senior Site Reliability Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted an hour ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Summary

Owns the operational backbone : observability (OTEL / Prometheus / logs / traces), CI/CD for both application services and infra code, disaster recovery runbooks, and the delivery cadence that makes the platform operable day-to-day.

What You'll Do

  • Stand up the observability stack : OpenTelemetry, Prometheus (tenant-labelled), Loki / Elastic, Grafana, alerting.
  • Build CI/CD pipelines for application services (container build, test, sign, push) and for IaC (plan, policy-check, apply).
  • Define and own the SLOs, error budgets, and alert routing for each service.
  • Author and drill DR / failover runbooks cloud region flip, VPN primary-to-secondary, DC orchestrator failure.
  • Own the release train : environment promotion, feature-flag rollout, canary playbook, rollback drills.
  • Lead production incidents, post-mortems, and follow-up action tracking.
  • Define capacity and cost dashboards; partner with the backend engineer on the metering-to-cost pipeline.

Must Have

  • Strong Kubernetes ops Argo CD, GitOps, Helm, operator patterns.
  • Terraform + one of (Pulumi / Ansible) for IaC; opinionated policy-as-code (OPA / Conftest / Sentinel).
  • OpenTelemetry + Prometheus in production; distributed tracing.
  • On-call experience, incident command, post-mortem culture.
  • Comfortable with Go or Python for ops tooling.

Nice To Have

  • SLO-driven autoscaling experience on GPU workloads.
  • Multi-region A/P failover orchestration background.
  • Sovereign-cloud / air-gapped CI/CD experience.

(ref:hirist.tech)

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147305047