Description:
We are looking for an experienced Data Engineer with proven expertise in building and optimizing ETL pipelines on Databricks, leveraging Delta Lake and Spark SQL. The ideal candidate will have a strong foundation in Python and SQL, a solid understanding of data storage formats such as Parquet and Delta, and experience in performance optimization, testing, and automated workflows.
Responsibilities:- ETL Development: Design and implement well-structured Databricks notebooks for ETL workflows, following best practices
- Data Storage: Utilize Delta Lake for data storage, demonstrating understanding of its benefits such as ACID transactions, schema enforcement, and time travel
- Data Transformation: Apply Spark SQL for complex data transformations and aggregations
- Delta Live Tables (DLT): Design and manage declarative, incremental pipelines on top of Delta Lake using Delta Live Tables. Leverage built-in orchestration, dependency management, and data quality checks for reliable ETL workflows
- File formats: Add explicit mention of Parquet, ORC, Avro, and JSON to ensure versatility in handling different formats
- Delta Sharing: Configure and manage Delta Sharing for secure, governed data distribution, integrating with Unity Catalog for access control, auditing, and automation as part of the data delivery process
- Data Governance: Leverage Unity Catalog for data lineage, tagging, and access control, enhancing data discoverability and ensuring compliance
- Error Handling & Validation: Implement proper exception handling, logging, and data validation checks to ensure data quality
- Automation: Develop automated triggers and job orchestration for pipeline execution
- Documentation: Maintain a comprehensive documentation explaining the project, dependencies, execution steps, and recommendations to stakeholders
- Test cases & Validation: Develop and maintain test cases to validate data transformations, schema consistency, and business rules, ensuring data accuracy and reliability across all pipeline stages
- Performance Optimization: Optimize ETL processes for scalability and reduced processing time
- Collaboration: Work closely with business analysts, data scientists, and stakeholders to deliver actionable insights
- Security best practices: Knowledge of encryption, masking, role-based access control in Databricks & cloud storage
Requirements- 5+ years in Data Engineering, with strong expertise in Databricks, PySpark, and Python.
- Dynamic, Self-motivated engineer with extensive logical reasoning and problem solving skills
- Strong experience in Python and SQL, with extensive debugging skills
- Version control & DevOps: Git/GitHub/GitLab for versioning, integration with CI/CD
- Hands-on experience with Databricks and Delta Lake
- Solid understanding of Spark SQL and distributed computing concepts
- Experience in ETL design, data modeling, and pipeline automation
- Knowledge of error handling, logging, and data validation techniques
- Experience with unit testing and integration testing in data pipelines
- Proven track record in performance tuning of large-scale data processing jobs
- Strong problem-solving and analytical skills
- Excellent written and verbal communication skills
Preferred Skills:- Experience with cloud platforms (Azure, AWS, or GCP) in a data engineering context.
- Knowledge of data governance and compliance best practices.