Design, develop, and maintain ETL pipelines using Python, PySpark, and SQL on distributed data platforms.
Write clean, efficient, and scalable PySpark code for big data transformation and processing.
Develop reusable scripts and tools for data ingestion, cleansing, validation, and aggregation.
Work with structured and semi-structured data (JSON, Parquet, Avro, etc.).
Optimize SQL queries for performance and cost-efficiency in data lakes or warehouses.
Collaborate with data architects, analysts, and BI developers to deliver end-to-end data solutions.
Participate in code reviews, peer programming, and unit/integration testing.
Support and troubleshoot issues in development, test, and production environments.
Document technical processes, data flows, and pipeline designs.
Required Skills and Qualifications:
Bachelor's degree in Computer Science, Engineering, or related technical field.
36+ years of hands-on experience in Python, PySpark, and SQL.
Proficiency in Apache Spark (RDD/DataFrame APIs), Spark performance tuning, and distributed computing concepts.
Strong experience with relational databases like PostgreSQL, SQL Server, or Oracle, and experience writing complex SQL queries, CTEs, joins, and window functions.
Familiarity with cloud platforms such as AWS, Azure, or GCP (e.g., EMR, Databricks, BigQuery, Synapse, etc.).
Experience with data lake and data warehouse concepts.