- Design, implement, and optimize ETL pipelines and data processing workflows using PySpark
- Work on distributed computing frameworks for large-scale data processing
- Collaborate with Databricks and other cloud platforms for data storage and transformation
- Perform data analysis, validation, and integration from multiple sources
- Troubleshoot and resolve data pipeline and processing issues
- Maintain proper documentation of data workflows, pipelines, and processes
- Ensure best practices for performance, scalability, and data governance
Key Performance Indicators
- Timely delivery of data pipelines and ETL workflows
- Accuracy, consistency, and integrity of processed data
- Performance and scalability of data processing solutions
- Effective collaboration with cross-functional teams