Synopsis of the role
As a Lead Data Engineer, you will serve as the technical backbone of our data ecosystem, spearheading the design and implementation of high-performance data architectures using Azure Databricks and PySpark. You will be responsible for orchestrating complex, scalable ETL/ELT pipelines within Azure Data Factory, ensuring seamless data integration. By leveraging your mastery of SQL and distributed computing, you will optimize large-scale datasets to drive advanced analytics and business intelligence initiatives.
What you'll do
- As a Lead Data Engineer, you are expected to drive the technical roadmap and execution of our data strategy. Your role will encompass the following core responsibilities:
- Medallion Architecture Implementation: Design and maintain a multi-layered data lakehouse (Bronze, Silver, Gold) to ensure data quality, lineage, and structural refinement from raw ingestion to business-ready assets.
- Delta Lake Development: Build and optimize high-performance tables using Delta Lake, leveraging features like ACID transactions, schema enforcement, and time travel to ensure data reliability.
- Star Schema Data Modeling: Architect robust dimensional models and Star Schemas in the Gold layer to simplify data access and optimize query performance for downstream BI tools.
- Data Governance with Unity Catalog: Implement and manage centralized Unity Catalog configurations to enforce fine-grained access control, data discovery, and comprehensive lineage across the Azure workspace.
- Scalable PySpark Engineering: Develop, test, and deploy complex data transformation logic using PySpark, ensuring efficient distributed processing and resource utilization within Databricks clusters.
- End-to-End Pipeline Orchestration: Create and monitor sophisticated ETL/ELT workflows using Azure Data Factory (ADF), integrating diverse data sources into a unified cloud ecosystem.
- Data Mart Development: Build specialized, high-performance Data Marts tailored to specific business domains, enabling self-service analytics and rapid decision-making for stakeholders.
- Advanced SQL Optimization: Write and tune complex SQL queries for data analysis and validation, ensuring that data processing logic is both performant and cost-effective.
- Performance Tuning & CI/CD: Lead efforts in cluster configuration, partition tuning, and the automation of deployment pipelines using Azure DevOps to ensure high availability and continuous delivery.
What Experience You Need
- Total Data Engineering Experience: Minimum of 8+ years in data engineering, data warehousing, or Database development roles.
- Azure Ecosystem Expertise: At least 4+ years of hands-on experience specifically building and deploying production-grade solutions on Azure.
- Databricks & PySpark Mastery: A minimum of 3+ years leading projects that utilize Azure Databricks and PySpark for large-scale distributed data processing.
- Lead/Architectural Experience: At least 2+ years in a Lead or Senior capacity, with documented experience in designing end-to-end data architectures (e.g., transitioning a legacy system to a Medallion architecture).
- SQL Proficiency: 6+ years of advanced SQL development, including performance tuning, complex window functions, and stored procedure optimization.
- Production Pipeline Delivery: Proven track record of deploying at least 3-5 enterprise-scale data pipelines using Azure Data Factory (ADF) from inception to production.
- Education/Certifications: Bachelor's / Master's degree in CS or related field, Possession of at least one relevant professional certification, such as Microsoft Certified: Azure Data Engineer Associate (DP-203) or Databricks Certified Data Engineer Professional.
What could set you apart
- Advanced Databricks Optimization & Lakehouse Features
- Databricks SQL & Serverless: Experience migrating traditional SQL workloads to Databricks SQL Warehouses to reduce latency and overhead.
- Delta Live Tables (DLT): Proven ability to implement declarative data pipelines that handle task orchestration, monitoring, and quality constraints automatically.
- Liquid Clustering: Mastery of the latest replacement for Z-Ordering to optimize data layout and query performance without manual partition management.
- DevOps & Infrastructure as Code (IaC)
- Terraform/Bicep: Experience deploying entire Azure environments (Resource Groups, Storage Accounts, Databricks Workspaces) using IaC to ensure environment parity across Dev, QA, and Production.
- Unit Testing for Spark: Experience using frameworks like pytest or chispa to validate PySpark logic, ensuring a robust CI/CD cycle rather than testing in production.
- Comprehensive Data Governance & Security
- Unity Catalog Migration: Experience leading a migration from legacy system to Unity Catalog, including managing Identity Federation and Cross-Workspace sharing.
- Fine-Grained Security: Implementation of Row-Level Security (RLS) and Column-Level Masking directly within Databricks to comply with strict privacy regulations (GDPR/CCPA).
- Real-Time & Hybrid Processing
- Structured Streaming: Building production-grade, low-latency pipelines that process data in real-time or near-real-time from Azure Event Hubs or IoT Hubs.
- Change Data Capture (CDC): Implementing efficient CDC patterns (using tools like Debezium or ADF's built-in CDC) to sync on-premise relational databases with the Delta Lake in near-real-time.
- Cost Governance & FinOps
- DBU & Cost Management: A track record of implementing Databricks Cluster Policies and tagging strategies to monitor and reduce DBU (Databricks Unit) consumption.
- Optimization of ADF Triggers: Knowledge of when to use Tumbling Window vs. Schedule triggers and optimizing Integration Runtimes to minimize execution costs.
#India