We are seeking a skilled Data Engineer with hands-on experience in Databricks and PySpark to design and implement scalable data pipelines and analytics solutions. The ideal candidate will work closely with data architects, analysts, and business stakeholders to transform large volumes of raw data into clean, usable datasets that drive business decisions.
Key Responsibilities:
- Design, develop, and maintain ETL/ELT pipelines using Databricks, PySpark, and Apache Spark.
- Build data lakes and data warehouses on cloud platforms (preferably Azure, AWS, or ).
- Implement scalable and optimized data transformation processes for structured and semi-structured data.
- Collaborate with data analysts and data scientists to understand data requirements and provide clean, curated datasets.
- Perform data quality checks, validation, and error handling within pipelines.
- Optimize data pipelines for performance, cost-efficiency, and resilience.
- Monitor and troubleshoot data jobs using tools such as Databricks Jobs, Apache Airflow, or Azure Data Factory.
- Ensure compliance with data security, privacy, and governance standards.
Required Skills:
- 45 years of experience as a Data Engineer.
- Strong hands-on experience with Databricks, PySpark, and Apache Spark.
- Proficiency in Python and SQL for data manipulation and processing.
- Experience working with big data platforms and cloud ecosystems (Azure, AWS, or GCP).
- Knowledge of Delta Lake, Lakehouse architecture, and Parquet/Avro file formats.
- Familiarity with Git, CI/CD pipelines, and Agile development practices.
Preferred Skills:
- Experience with Airflow, Azure Data Factory, or other orchestration tools.
- Understanding of data modeling techniques (star schema, snowflake).
- Knowledge of DevOps and infrastructure as code tools (Terraform, ARM templates) is a plus.
- Exposure to machine learning pipelines or MLflow is a bonus.