Synopsis of the role
As a Data Engineer, you will be a key builder within our data ecosystem, responsible for developing and maintaining the scalable data pipelines that power our business. Working closely with Lead Engineers and Architects, you will use Azure Databricks, PySpark, and Azure Data Factory to transform raw data into actionable insights. You will apply software engineering best practices to data processing, ensuring our Medallion Architecture remains performant, reliable, and secure.
What you'll do
As a Data Engineer, you will focus on the development, automation, and optimization of our cloud data platform. Your core responsibilities include:
- Pipeline Development: Build and deploy robust ETL/ELT workflows using Azure Data Factory (ADF) to ingest data from diverse internal and external sources.
- Spark Engineering: Write clean, efficient PySpark code to perform complex data transformations, ensuring optimal resource utilization on Databricks clusters.
- Lakehouse Maintenance: Develop and manage Delta Lake tables across Bronze, Silver, and Gold layers, implementing schema enforcement and data quality checks.
- Data Modeling: Translate business requirements into physical data models, implementing Star Schemas and dimensional modeling to support BI tools like Power BI.
- SQL Optimization: Author and tune sophisticated SQL queries for data validation, ad-hoc analysis, and reporting layer performance.
- Data Governance Support: Work within Unity Catalog to manage data assets, ensuring proper tagging, documentation, and adherence to access control policies.
- Automated Testing & CI/CD: Participate in the full DevOps lifecycle, writing unit tests for Spark logic and using Azure DevOps for continuous integration and deployment.
- Monitoring & Troubleshooting: Proactively monitor pipeline health, identify bottlenecks, and resolve production issues to maintain high data availability.
What Experience You Need
- Total Data Engineering Experience: 36 years of hands-on experience in data engineering, ETL development, or backend software engineering with a data focus.
- Azure Foundations: 4+ years of experience working within the Azure cloud environment (Storage Accounts, Key Vault, Resource Groups).
- Databricks & PySpark: 4+ years of experience building data transformation logic specifically using Databricks and Spark (Python preferred).
- Relational Mastery: 3+ years of strong SQL skills, with a deep understanding of joins, window functions, and query execution plans.
- Orchestration: Proven experience building multi-stage pipelines in Azure Data Factory or similar tools (e.g., Airflow, Synapse Pipelines).
- Data Modeling Basics: Solid understanding of data warehousing concepts, including slowly changing dimensions (SCD) and Fact/Dimension table design.
- Education/Certifications: Bachelor's degree in CS or a related field. An Azure Data Engineer Associate (DP-203) certification is highly preferred.
What could set you apart
Modern Data Stack Features
- Delta Live Tables (DLT): Experience using DLT to simplify streaming and batch ETL development.
- Databricks SQL: Familiarity with configuring SQL Warehouses for analyst self-service.
Software Engineering Rigor
- Testing Frameworks: Experience with pytest or chispa for validating Spark transformations.
- Python Proficiency: Strong general-purpose Python skills beyond just Spark (API integrations, automation scripts).
Performance & Scaling
- Partitioning & Z-Ordering: Deep understanding of how to optimize Delta tables for large-scale query performance.
- Streaming: Experience with Structured Streaming for real-time data ingestion from Event Hubs or Kafka.
Security & Compliance
- Networking: Understanding of Azure VNet integration, Private Links, and secure data transit.
- Data Privacy: Experience implementing data masking or encryption at rest/in transit.
Advanced Data Governance & Security
- Unity Catalog Implementation: Experience configuring and managing Unity Catalog for fine-grained access control (Row-Level Security and Column-Level Masking) and tracking end-to-end data lineage.
- Data Quality Frameworks: Expertise in building automated data validation using frameworks like Great Expectations or Databricks Expectations (DLT) to ensure data integrity before it reaches the Gold layer.
- Metadata Management: Ability to maintain a searchable data catalog, ensuring all assets are tagged for PII (Personally Identifiable Information) and comply with GDPR/CCPA regulations.
Sophisticated CI/CD & DataOps
- Infrastructure as Code (IaC): Proficiency in using Terraform or Bicep to deploy and manage Azure resources (Databricks workspaces, Key Vaults, Storage Accounts) as code.
- Automated Testing Suites: Experience implementing a Test-Driven Development (TDD) approach for data, using pytest or chispa to run unit tests on PySpark transformations within the build pipeline.
- Azure DevOps Integration: Mastery of YAML-based Azure Pipelines for automated deployment, including specialized tasks for Databricks Asset Bundles (DABs) or the Databricks CLI.
- Environment Parity & Promotion: Proven ability to manage complex deployment patterns (Dev > QA > Prod) ensuring seamless promotion of code, ADF triggers, and Databricks job configurations.
- Monitoring & Alerting: Setting up proactive monitoring using Azure Monitor and Log Analytics to track pipeline failures and cluster performance in real-time.
#India