Senior DevOps Engineer — Docker, Terraform, Real-Time Infra Employment type: Full-time Seniority level: Mid-Senior Workplace: In person/ Hybrid (India preferred, IST hours)
Industry: Software Development
Job function: Engineering, Information Technology
About the role
You will be working on Docker, CI/CD, IaC, monitoring, cloud deployment, secrets management.
Just a Python FastAPI app that spawns Node.js subprocesses, juggles WebRTC signaling and WebSocket audio, and handles long-lived real-time voice sessions.
Your job: make the software deployable, observable, and reliable - from scratch. If the phrase connection draining for WebRTC during rolling deploys makes you smile instead of wince, we should talk.
What you'll do
- Containerize the app with multi-stage Docker builds: Python 3.12 + Node.js 20 (for Pipedream MCP via npx) + FAISS CPU + audio deps
- Build the CI/CD pipeline: ruff → pytest → integration → Docker build → registry → staging → smoke → manual gate → production
- Write Terraform for the full cloud stack: ECS Fargate or Cloud Run, ALB with WebSocket upgrade, managed Redis, managed Postgres, S3/GCS, CDN, DNS, ACM
- Build the observability stack: structured JSON logs, Prometheus metrics (call latency, LLM TTFB, tool execution, concurrent connections), Grafana dashboards, PagerDuty alerts with runbooks
- Migrate secrets from .env to AWS Secrets Manager / Vault, with key rotation and per-tenant credential storage
- Configure networking: TLS, WebSocket upgrade through ALB, CORS, infra-level rate limiting, DDoS protection, VPC with private subnets
- Build load testing (k6 or Locust) simulating concurrent voice calls, chat, and MCP tool invocations
- Write operational runbooks: incident response, DR, rollback, on-call rotation, post-incident reviews
Tech you'll work with
Docker · docker-compose · GitHub Actions · Terraform · AWS (ECS Fargate, ALB, ElastiCache, RDS, S3, CloudFront, Route53, ACM, Secrets Manager, CloudWatch) or GCP equivalents · Prometheus · Grafana · k6 / Locust · Python 3.12 · Node.js 20 · WebSocket · WebRTC
What you bring
- 3+ years in DevOps / SRE / Platform Engineering
- Strong Docker: multi-stage, multi-runtime, security hardening
- Production CI/CD pipeline design (GitHub Actions, GitLab CI, or Jenkins)
- Terraform on real infrastructure (not tutorial-scale)
- AWS or GCP at production scale
- WebSocket / real-time app deployment: sticky sessions, connection draining, stateful health checks
- Prometheus / Grafana or equivalent observability stack
- Strong Linux and networking fundamentals
Nice to have
- Load testing experience (k6, Locust)
- Python app deployment (uv/pip, FastAPI, Uvicorn)
- WebRTC operational experience
Skills
- Docker · Terraform · AWS · GCP · CI/CD · GitHub Actions · Kubernetes · Prometheus · Grafana · Infrastructure as Code · DevOps · Site Reliability Engineering (SRE) · WebSocket · WebRTC · Linux · Networking · Observability · ECS · Fargate · PostgreSQL · Redis
Data: SQL / NoSQL databases, message queues
Observability: CloudWatch, logging, and metrics