Greetings from Techwurkz!
Your skills and experience a US-offshore project requirement of our key client.
Do let us know of your interest and availability to connect.
Also, we welcome being connected with you through our LinkedIn page.
Team Techwurkz
[Confidential Information]
https://www.linkedin.com/company/techwurkz/
Position: Site Reliability Engineer (SRE) - Data Platform (Kubernetes, Linux, Networking, Docker, Python, Spark/Airflow)
Experience: 4+ years
Shift Timing: CST Time Zone ie., 7:30pm IST to 4:30am IST
Annual CTC Budget: INR 25 Lakh
Work location: Remote (India)
Immediate Joiner
Key Responsibilities
- Troubleshoot and resolve complex application issues, providing detailed root cause analysis and preventative measures.
- Provide direct support to individual users and cross-functional teams, diagnosing and resolving their Spark job-related problems with a focus on understanding the core issue.
- Maintain the reliability and performance of critical infrastructure that power Spark and Airflow.
- Automate operational tasks and improve system efficiency through scripting, always looking for opportunities to enhance stability and reduce manual intervention.
- Collaborate with development and other SRE teams to identify root causes of issues and implement robust, long-term solutions.
- Participate in on-call rotations to ensure continuous availability and rapid response to incidents.
- Ensuring stability, performance, and reliability of company's internal platforms
- Troubleshooting and maintaining applications built on a complex, distributed, and cloud-native infrastructure, heavily leveraging technologies like Apache Spark and Apache Airflow.
- Support cross-functional teams across Apple by ensuring their Spark jobs run smoothly and efficiently, providing essential operational support and expertise
- Responsible for diagnosing and resolving complex issues related to Spark applications and workflows running on our internal platform
Required Qualifications
- Strong understanding and hands-on experience troubleshooting applications deployed on Kubernetes.
- Basic proficiency in Python for scripting and automation tasks.
- Deep and practical knowledge of networking principles,with the ability to diagnose network issues using standard command-line tools.
- Experience with containerization technologies (e.g., Docker) and their orchestration.
- Proficient in Linux operating systems, including:
- Advanced command-line tools for system diagnostics and troubleshooting (e.g., for inspecting network routes, open files, process information).
- Scripting and system administration.
- A strong desire to understand the internals of the OS.
Preferred Qualifications
- Familiarity with Apache Spark and Apache Airflow, given their central role in day-to-day troubleshooting.
- Basic understanding of Java, which may be occasionally required for specific jobs, though deeper Java expertise is often handled by cross-functional teams.
Tech Stack
Python scripting, Linux, Networking, Kubernetes, Docker, Previous SRE experience