Design, develop, and maintain scalable and efficient cloud-based data infrastructure using SQL and PySpark.
Collaborate with data scientists, analysts, and other stakeholders to understand data requirements and deliver appropriate data solutions.
Identify potential data sources, define data ingestion architecture, and implement efficient data pipelines.
Ensure smooth flow of data from sources to data lakes, warehouses, and analytical platforms.
Troubleshoot and resolve issues related to data processing, pipeline performance, and data quality.
Implement automated testing frameworks for data validation, integrity, and pipeline performance.
Integrate automated tests into CI/CD pipelines for real-time testing and deployment.
Work with streaming data and real-time APIs/services, performing automated validation using custom scripts and assertions.
Document data infrastructure, ETL processes, and pipelines to ensure knowledge transfer and maintainability.
Stay updated with emerging cloud technologies, data engineering tools, and best practices.
Ensure compliance with data governance, security, and quality standards.
Support DevOps activities including installation, configuration, and integration of automation scripts on continuous integration tools like GitHub and CI/CD pipelines.