Site Reliability Engineer (SRE) – Data Platform

4-10 Years

15 - 21 LPA

Save

Quick Apply

Job Description

Greetings from Techwurkz!

Your skills and experience a US-offshore project requirement of our key client.

Do let us know of your interest and availability to connect.

Also, we welcome being connected with you through our LinkedIn page.

Team Techwurkz

[Confidential Information]

Position: Site Reliability Engineer (SRE) - Data Platform (Kubernetes, Linux, Networking, Docker, Python, Spark/Airflow)

Experience: 4+ years

Shift Timing: CST Time Zone ie., 7:30pm IST to 4:30am IST

Annual CTC Budget: INR 25 Lakh

Work location: Remote (India)

Immediate Joiner

Key Responsibilities

Troubleshoot and resolve complex application issues, providing detailed root cause analysis and preventative measures.
Provide direct support to individual users and cross-functional teams, diagnosing and resolving their Spark job-related problems with a focus on understanding the core issue.
Maintain the reliability and performance of critical infrastructure that power Spark and Airflow.
Automate operational tasks and improve system efficiency through scripting, always looking for opportunities to enhance stability and reduce manual intervention.
Collaborate with development and other SRE teams to identify root causes of issues and implement robust, long-term solutions.
Participate in on-call rotations to ensure continuous availability and rapid response to incidents.
Ensuring stability, performance, and reliability of company's internal platforms
Troubleshooting and maintaining applications built on a complex, distributed, and cloud-native infrastructure, heavily leveraging technologies like Apache Spark and Apache Airflow.
Support cross-functional teams across Apple by ensuring their Spark jobs run smoothly and efficiently, providing essential operational support and expertise
Responsible for diagnosing and resolving complex issues related to Spark applications and workflows running on our internal platform

Required Qualifications

Strong understanding and hands-on experience troubleshooting applications deployed on Kubernetes.
Basic proficiency in Python for scripting and automation tasks.
Deep and practical knowledge of networking principles, with the ability to diagnose network issues using standard command-line tools.
Experience with containerization technologies (e.g., Docker) and their orchestration.
Proficient in Linux operating systems, including:
Advanced command-line tools for system diagnostics and troubleshooting (e.g., for inspecting network routes, open files, process information).
Scripting and system administration.
A strong desire to understand the internals of the OS.

Preferred Qualifications

Familiarity with Apache Spark and Apache Airflow, given their central role in day-to-day troubleshooting.
Basic understanding of Java, which may be occasionally required for specific jobs, though deeper Java expertise is often handled by cross-functional teams.

Tech Stack

Python scripting, Linux, Networking, Kubernetes, Docker, Previous SRE experience