Key Skills & Responsibilities
- Strong expertise in PySpark and Apache Spark for batch and real-time data processing.
- Experience in designing and implementing ETL pipelines, including data ingestion, transformation, and validation.
- Proficiency in Python for scripting, automation, and building reusable components.
- Hands-on experience with scheduling tools like Airflow or Control-M to orchestrate workflows.
- Familiarity with AWS ecosystem, especially S3 and related file system operations.
- Strong understanding of Unix/Linux environments and Shell scripting.
- Experience with Hadoop, Hive, and platforms like Cloudera or Hortonworks.
- Ability to handle CDC (Change Data Capture) operations on large datasets.
- Experience in performance tuning, optimizing Spark jobs, and troubleshooting.
- Strong knowledge of data modeling, data validation, and writing unit test cases.
- Exposure to real-time and batch integration with downstream/upstream systems.
- Working knowledge of Jupyter Notebook, Zeppelin, or PyCharm for development and debugging.
- Understanding of Agile methodologies, with experience in CI/CD tools (e.g., Jenkins, Git).
Preferred Skills
- Experience in building or integrating APIs for data provisioning.
- Exposure to ETL or reporting tools such as Informatica, Tableau, Jasper, or QlikView.
- Familiarity with AI/ML model development using PySpark in cloud environments
- Skills: ci/cd,zeppelin,pycharm,pyspark,etl tools,control-m,unit test cases,tableau,performance tuning,jenkins,qlikview,informatica,jupyter notebook,api integration,unix/linux,git,aws s3,hive,cloudera,jasper,airflow,cdc,pyspark, apache spark, python, aws s3, airflow/control-m, sql, unix/linux, hive, hadoop, data modeling, and performance tuning,agile methodologies,aws,s3,data modeling,data validation,ai/ml model development,batch integration,apache spark,python,etl pipelines,shell scripting,hortonworks,real-time integration,hadoop.
- Mandatory Key Skills - Apache Spark,Python,ETL,Unix,Linux,data engineering,Agile methodologies,CI/CD,data modeling,data validation,PySpark.