Job Title : SRE Data Engineer
Experience : 3 to 6 Years
Location : Pune
Background
We are seeking a proactive and technically strong Site Reliability Engineer (SRE) to ensure the stability, performance, and scalability of our Data Engineering Platform. You will work on cutting-edge technologies including Cloudera Hadoop, Spark, Airflow, NiFi, and Kubernetesensuring high availability and driving automation to support massive-scale data workloads, especially in the telecom domain.
Key Responsibilities
- Ensure platform uptime and application health as per SLOs/KPIs
- Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
- Debug and resolve complex production issues, performing root cause analysis
- Automate routine tasks and implement self-healing systems
- Design and maintain dashboards, alerts, and operational playbooks
- Participate in incident management, problem resolution, and RCA documentation
- Own and update SOPs for repeatable processes
- Collaborate with L3 and Product teams for deeper issue resolution
- Support and guide L1 operations team
- Conduct periodic system maintenance and performance tuning
- Respond to user data requests and ensure timely resolution
- Address and mitigate security vulnerabilities and compliance issues
Technical Skillset
- Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger
- Strong Linux fundamentals and scripting (Python, Shell)
- Experience with Apache NiFi, Airflow, Yarn, and Zookeeper
- Proficient in monitoring and observability tools : ELK Stack, Prometheus, Loki
- Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines
- Strong SQL skills (Oracle/Exadata preferred)
- Familiarity with DataHub, DataMesh, and security best practices is a plus
Working Arrangements : Rotating 24/7 Shifts, 100% from Pune Office
(ref:hirist.tech)