Spydra - Senior Site Reliability Engineer

spydra

Bengaluru, India

Fresher

Save

Posted an hour ago
Be among the first 10 applicants

Early Applicant

Job Description

Role Summary

Owns the operational backbone : observability (OTEL / Prometheus / logs / traces), CI/CD for both application services and infra code, disaster recovery runbooks, and the delivery cadence that makes the platform operable day-to-day.

What You'll Do

Stand up the observability stack : OpenTelemetry, Prometheus (tenant-labelled), Loki / Elastic, Grafana, alerting.
Build CI/CD pipelines for application services (container build, test, sign, push) and for IaC (plan, policy-check, apply).
Define and own the SLOs, error budgets, and alert routing for each service.
Author and drill DR / failover runbooks cloud region flip, VPN primary-to-secondary, DC orchestrator failure.
Own the release train : environment promotion, feature-flag rollout, canary playbook, rollback drills.
Lead production incidents, post-mortems, and follow-up action tracking.
Define capacity and cost dashboards; partner with the backend engineer on the metering-to-cost pipeline.

Must Have

Strong Kubernetes ops Argo CD, GitOps, Helm, operator patterns.
Terraform + one of (Pulumi / Ansible) for IaC; opinionated policy-as-code (OPA / Conftest / Sentinel).
OpenTelemetry + Prometheus in production; distributed tracing.
On-call experience, incident command, post-mortem culture.
Comfortable with Go or Python for ops tooling.

Nice To Have