Lead - Data Engineer

equifax india

Bengaluru, India

8-10 Years

Save

Posted 3 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Synopsis of the role

As a Lead Data Engineer, you will serve as the technical backbone of our data ecosystem, spearheading the design and implementation of high-performance data architectures using Azure Databricks and PySpark. You will be responsible for orchestrating complex, scalable ETL/ELT pipelines within Azure Data Factory, ensuring seamless data integration. By leveraging your mastery of SQL and distributed computing, you will optimize large-scale datasets to drive advanced analytics and business intelligence initiatives.

What you'll do

As a Lead Data Engineer, you are expected to drive the technical roadmap and execution of our data strategy. Your role will encompass the following core responsibilities:
Medallion Architecture Implementation: Design and maintain a multi-layered data lakehouse (Bronze, Silver, Gold) to ensure data quality, lineage, and structural refinement from raw ingestion to business-ready assets.
Delta Lake Development: Build and optimize high-performance tables using Delta Lake, leveraging features like ACID transactions, schema enforcement, and time travel to ensure data reliability.
Star Schema Data Modeling: Architect robust dimensional models and Star Schemas in the Gold layer to simplify data access and optimize query performance for downstream BI tools.
Data Governance with Unity Catalog: Implement and manage centralized Unity Catalog configurations to enforce fine-grained access control, data discovery, and comprehensive lineage across the Azure workspace.
Scalable PySpark Engineering: Develop, test, and deploy complex data transformation logic using PySpark, ensuring efficient distributed processing and resource utilization within Databricks clusters.
End-to-End Pipeline Orchestration: Create and monitor sophisticated ETL/ELT workflows using Azure Data Factory (ADF), integrating diverse data sources into a unified cloud ecosystem.
Data Mart Development: Build specialized, high-performance Data Marts tailored to specific business domains, enabling self-service analytics and rapid decision-making for stakeholders.
Advanced SQL Optimization: Write and tune complex SQL queries for data analysis and validation, ensuring that data processing logic is both performant and cost-effective.
Performance Tuning & CI/CD: Lead efforts in cluster configuration, partition tuning, and the automation of deployment pipelines using Azure DevOps to ensure high availability and continuous delivery.

What Experience You Need

Total Data Engineering Experience: Minimum of 8+ years in data engineering, data warehousing, or Database development roles.
Azure Ecosystem Expertise: At least 4+ years of hands-on experience specifically building and deploying production-grade solutions on Azure.
Databricks & PySpark Mastery: A minimum of 3+ years leading projects that utilize Azure Databricks and PySpark for large-scale distributed data processing.
Lead/Architectural Experience: At least 2+ years in a Lead or Senior capacity, with documented experience in designing end-to-end data architectures (e.g., transitioning a legacy system to a Medallion architecture).
SQL Proficiency: 6+ years of advanced SQL development, including performance tuning, complex window functions, and stored procedure optimization.
Production Pipeline Delivery: Proven track record of deploying at least 3-5 enterprise-scale data pipelines using Azure Data Factory (ADF) from inception to production.
Education/Certifications: Bachelor's / Master's degree in CS or related field, Possession of at least one relevant professional certification, such as Microsoft Certified: Azure Data Engineer Associate (DP-203) or Databricks Certified Data Engineer Professional.

What could set you apart

Advanced Databricks Optimization & Lakehouse Features
Databricks SQL & Serverless: Experience migrating traditional SQL workloads to Databricks SQL Warehouses to reduce latency and overhead.
Delta Live Tables (DLT): Proven ability to implement declarative data pipelines that handle task orchestration, monitoring, and quality constraints automatically.
Liquid Clustering: Mastery of the latest replacement for Z-Ordering to optimize data layout and query performance without manual partition management.
DevOps & Infrastructure as Code (IaC)
Terraform/Bicep: Experience deploying entire Azure environments (Resource Groups, Storage Accounts, Databricks Workspaces) using IaC to ensure environment parity across Dev, QA, and Production.
Unit Testing for Spark: Experience using frameworks like pytest or chispa to validate PySpark logic, ensuring a robust CI/CD cycle rather than testing in production.
Comprehensive Data Governance & Security
Unity Catalog Migration: Experience leading a migration from legacy system to Unity Catalog, including managing Identity Federation and Cross-Workspace sharing.
Fine-Grained Security: Implementation of Row-Level Security (RLS) and Column-Level Masking directly within Databricks to comply with strict privacy regulations (GDPR/CCPA).
Real-Time & Hybrid Processing
Structured Streaming: Building production-grade, low-latency pipelines that process data in real-time or near-real-time from Azure Event Hubs or IoT Hubs.
Change Data Capture (CDC): Implementing efficient CDC patterns (using tools like Debezium or ADF's built-in CDC) to sync on-premise relational databases with the Delta Lake in near-real-time.
Cost Governance & FinOps
DBU & Cost Management: A track record of implementing Databricks Cluster Policies and tagging strategies to monitor and reduce DBU (Databricks Unit) consumption.
Optimization of ADF Triggers: Knowledge of when to use Tumbling Window vs. Schedule triggers and optimizing Integration Runtimes to minimize execution costs.

#India