Senior SRE Observability & Surrounding Systems
Location - Gurugram (On-site)
Responsibilities
- Own end-to-end observability stack: Prometheus, Apache SkyWalking, Elasticsearch, Grafana from ingestion to alerting.
- Operate and maintain critical surrounding systems: MongoDB, Kafka, Redis, Vault, WSO2.
- Provide L2/L3 support for platform stability and incident resolution.
- Automate monitoring, alerting, and recovery workflows using Bash/Python.
- Troubleshoot cross-layer issues: apps, K8s, nodes, networks, storage.
- Collaborate with DevSecOps engineers to harden platform resilience.
- Ensure observability coverage for all production services.
Profile
- 3+ years in SRE/DevOps with focus on observability and infrastructure.
- Proven hands-on experience with Prometheus, Elasticsearch, Apache SkyWalking or other APM application.
- Operational expertise in MongoDB, Kafka, Redis, Vault, WSO2 APIM or similar.
- Strong scripting in Bash or Python for automation.
- Deep understanding of distributed systems and failure modes.
- Must have incident ownership.