Role Overview
We are looking for a skilled Spark on Kubernetes Support Engineer to provide L1/L2 support for large-scale data platforms. This role involves monitoring, troubleshooting, and optimizing Spark workloads running on Kubernetes, ensuring high availability and performance of data pipelines.
Key Responsibilities
- Act as first-level escalation for 24×7 monitoring of Spark (batch & streaming) workloads on Kubernetes
- Troubleshoot Spark job failures, performance issues, and resource bottlenecks
- Diagnose Kubernetes issues (pod failures, OOMKilled, evictions, DiskPressure, scaling issues)
- Monitor Spark UI, cluster health, and resource utilization
- Collaborate with development teams to debug and optimize pipelines
- Handle Sev1/Sev2 incidents, including RCA and war-room coordination
- Build and maintain monitoring dashboards and alerting frameworks (Prometheus/Grafana/ELK)
- Support CI/CD pipelines and deployment automation using Azure DevOps
- Maintain SOPs, runbooks, and drive continuous improvements
Required Skills
- 3–10 years in Big Data / Distributed Systems / Cloud Support
- Strong expertise in Apache Spark (Core, SQL, Structured Streaming)
- Hands-on experience with Spark on Kubernetes
- Good understanding of Kubernetes architecture & troubleshooting
- Experience with Azure DevOps (CI/CD pipelines, Git, deployments)
- Strong knowledge of Linux, SQL, and scripting (Python/Shell)
- Familiarity with monitoring tools: Prometheus, Grafana, ELK
Good to Have
- Experience with Kafka / streaming ecosystems
- Exposure to cloud platforms (Azure/AWS/GCP)