Search by job, company or skills

P

Data Engineer

5-7 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role Overview

We are looking for a hands-on, senior Databricks Architect to design, build, and govern our Lakehouse data platform from the ground up. You will own the end-to-end architecture of our data infrastructure from raw ingestion through the Medallion layers to serving and establish the engineering standards that will guide the entire data organization.

This is a highly strategic and technical role focused on driving adoption of Databricks, Unity Catalog, and modern Lakehouse patterns across all data products and pipelines.

Key Responsibilities

Lakehouse Architecture & Design

  • Design and implement a production-grade Medallion Architecture (Bronze / Silver / Gold) across all data pipelines.
  • Establish best practices for Delta Lake table design, partitioning strategies, Z-ordering, and optimization across large-scale datasets.
  • Define data modeling standards and schema evolution policies across the Lakehouse.
  • Architect end-to-end data flows from ingestion (streaming and batch) through transformation and serving layers.

Unity Catalog & Data Governance

  • Lead the setup, configuration, and rollout of Unity Catalog as the centralized governance layer for all data assets.
  • Design metastore hierarchy, catalog/schema/table organization, and tagging standards.
  • Implement fine-grained access control (row-level, column-level), data masking policies, and audit logging.
  • Establish data lineage tracking and ensure end-to-end visibility across all pipelines.
  • Define and enforce data classification and sensitivity frameworks for PII and regulated data assets.

Pipeline Development & Orchestration

  • Build and maintain production-grade data pipelines using PySpark, Delta Live Tables (DLT), and Databricks Workflows / Jobs.
  • Design modular, reusable pipeline patterns including incremental ingestion, CDC (Change Data Capture), and full-refresh strategies.
  • Implement robust pipeline observability: logging, alerting, lineage tracking, and SLA monitoring.
  • Leverage Databricks Repos for CI/CD integration, managing code promotion across dev / staging / production environments.

Performance & Compute Optimization

  • Optimize Spark execution plans, identify and resolve performance bottlenecks across large-scale distributed workloads.
  • Right-size cluster configurations: Serverless warehouses, auto-scaling job clusters, and photon-enabled SQL warehouses.
  • Leverage Serverless Warehouses and SQL Warehouses for BI and ad hoc analytics workloads, minimizing cost and cold-start latency.
  • Manage cost governance for compute, storage, and DBU consumption across workspaces.

Developer Experience & Standards

  • Set up and maintain Databricks Repos with standardized project structures and Git integration.
  • Define Python coding standards, notebook best practices, and modular library patterns for the data engineering team.
  • Build reusable Python utility libraries for common patterns: schema validation, data quality checks, Delta operations, and logging.
  • Establish unit testing and integration testing frameworks for Spark pipelines.

Security, Compliance & Networking

  • Configure workspace-level and account-level security: Private Link, IP access lists, secrets management via Databricks Secrets or AWS Secrets Manager.
  • Design and enforce network isolation for sensitive data workloads.
  • Ensure compliance with data residency and access control requirements for customer data.

Collaboration & Enablement

  • Partner with data engineers, data scientists, and analytics engineers to ensure the platform meets diverse workload needs.
  • Mentor the engineering team on Databricks, Spark optimization, and Lakehouse best practices.
  • Produce architectural documentation, runbooks, and internal knowledge bases.
  • Evaluate and recommend new Databricks features and third-party integrations relevant to the organization's data roadmap.

Required Qualifications

Core Databricks & Lakehouse

  • 5+ years of hands-on experience with Databricks, with at least 2 years in an architect or senior lead role.
  • Deep expertise in Unity Catalog: metastore setup, three-level namespace, ACL design, and data governance workflows.
  • Strong mastery of the Medallion Architecture and Delta Lake: ACID transactions, time travel, compaction, and OPTIMIZE/VACUUM strategies.
  • Proven experience designing and deploying production pipelines with Databricks Jobs and Workflows, including multi-task job DAGs, retry logic, and notifications.
  • Hands-on experience with Databricks Repos and CI/CD integration for notebook and Python library deployments.
  • Experience configuring and operating Serverless SQL Warehouses and Serverless compute for Jobs.

Apache Spark

  • Expert-level PySpark development: DataFrames, Spark SQL, window functions, broadcast joins, and UDFs.
  • Strong understanding of Spark internals: DAG execution, shuffle optimization, memory management, and speculative execution.
  • Experience with structured streaming and micro-batch processing patterns.
  • Proven ability to diagnose and resolve Spark performance issues using Spark UI and event logs.

Python & Software Engineering

  • Advanced Python skills with a strong software engineering background: packaging, testing (pytest), virtual environments, and dependency management.
  • Experience building modular Python libraries for data engineering use cases.
  • Familiarity with common data engineering libraries: pandas, pydantic, great_expectations or similar DQ frameworks.

Cloud & Infrastructure

  • Experience deploying Databricks on AWS, including workspace provisioning, IAM integration, and VPC configuration.
  • Familiarity with cloud-native storage (S3/ADLS), external locations in Unity Catalog, and storage credentials management.
  • Exposure to infrastructure-as-code tooling (Terraform, Databricks Asset Bundles, or similar).

Preferred Qualifications

  • Databricks Certified Data Engineer Professional or Databricks Certified Associate Developer for Apache Spark certifications.
  • Experience with Delta Live Tables (DLT) for declarative pipeline authoring.
  • Familiarity with dbt (data build tool) integrated with Databricks SQL.
  • Experience with Databricks Feature Store or MLflow for ML platform use cases.
  • Exposure to Databricks Marketplace and Partner Connect integrations.
  • Experience with Elasticsearch, Apache Kafka, or other streaming/search technologies complementary to the Lakehouse.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144632601

Similar Jobs