Greetings from Peoplefy!
We're looking for an SRE who can own reliability for mission-critical services on Azure, shape standards, lead incidents with calm clarity, and drive engineering excellence across teams
Experience: 10+ years
Location: Trivandrum
Responsibilities:
- Strong site reliability experience
- Previously worked as DevOps engineer and at present working as SRE
- Strong experience in Azure
- Strong experience with AKS
- Experience working in docker
- Experience with observability (Any tool)
- Experience working on PostgreSQL
SLIs/SLOs & Error Budgets
- Define SLIs/SLOs for Tier-0/Tier-1 services & review quarterly
- Implement multi-window, multi-burn-rate alerts
- Change gating via CI/CD based on error budgets
- Maintain Azure Monitor / Grafana / Prometheus / App Insights dashboards
- Conduct weekly SLO reviews & drive reliability roadmap
Incident Management
- Lead SEV1/SEV2 incidents, own communication & postmortems
- Ensure corrective actions are implemented
Reliability Engineering
- Implement DR, multi-AZ/region patterns, HPA/VPA/KEDA, resilient rollouts
- Cluster hardening (network, identity, policy), optimize density
- Ingress: AGIC / Nginx
Observability
- Metrics, traces, logs via Azure Monitor, App Insights, Log Analytics, Prometheus, Grafana, OpenTelemetry
- Alerts on symptoms, not noise
Automation & IaC
- Terraform / Bicep, GitOps (Flux/Argo), Azure Policy/OPA Gatekeeper
- Automate toil & build self-service runbooks/chatops
CI/CD Reliability
- Azure DevOps / GitHub Actions with canary, blue-green, rollback
- Key Vault-backed secrets
Performance & Capacity
- Load testing, autoscaling, FinOps collaboration
Disaster Recovery
- Define RTO/RPO, run chaos drills & game days
Security
- Entra ID, Key Vault rotation, VNets/NSGs, shift-left security in CI
Documentation
- Runbooks, SLOs, postmortems, architectures kept current & accessible
Interested candidates please share your updated resumes on [Confidential Information]