Data Engineer

PropStream

India

5-7 Years

Save

Posted 6 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Role Overview

We are looking for a hands-on, senior Databricks Architect to design, build, and govern our Lakehouse data platform from the ground up. You will own the end-to-end architecture of our data infrastructure from raw ingestion through the Medallion layers to serving and establish the engineering standards that will guide the entire data organization.

This is a highly strategic and technical role focused on driving adoption of Databricks, Unity Catalog, and modern Lakehouse patterns across all data products and pipelines.

Key Responsibilities

Lakehouse Architecture & Design

Design and implement a production-grade Medallion Architecture (Bronze / Silver / Gold) across all data pipelines.
Establish best practices for Delta Lake table design, partitioning strategies, Z-ordering, and optimization across large-scale datasets.
Define data modeling standards and schema evolution policies across the Lakehouse.
Architect end-to-end data flows from ingestion (streaming and batch) through transformation and serving layers.

Unity Catalog & Data Governance

Lead the setup, configuration, and rollout of Unity Catalog as the centralized governance layer for all data assets.
Design metastore hierarchy, catalog/schema/table organization, and tagging standards.
Implement fine-grained access control (row-level, column-level), data masking policies, and audit logging.
Establish data lineage tracking and ensure end-to-end visibility across all pipelines.
Define and enforce data classification and sensitivity frameworks for PII and regulated data assets.

Pipeline Development & Orchestration

Build and maintain production-grade data pipelines using PySpark, Delta Live Tables (DLT), and Databricks Workflows / Jobs.
Design modular, reusable pipeline patterns including incremental ingestion, CDC (Change Data Capture), and full-refresh strategies.
Implement robust pipeline observability: logging, alerting, lineage tracking, and SLA monitoring.
Leverage Databricks Repos for CI/CD integration, managing code promotion across dev / staging / production environments.

Performance & Compute Optimization

Optimize Spark execution plans, identify and resolve performance bottlenecks across large-scale distributed workloads.
Right-size cluster configurations: Serverless warehouses, auto-scaling job clusters, and photon-enabled SQL warehouses.
Leverage Serverless Warehouses and SQL Warehouses for BI and ad hoc analytics workloads, minimizing cost and cold-start latency.
Manage cost governance for compute, storage, and DBU consumption across workspaces.

Developer Experience & Standards

Set up and maintain Databricks Repos with standardized project structures and Git integration.
Define Python coding standards, notebook best practices, and modular library patterns for the data engineering team.
Build reusable Python utility libraries for common patterns: schema validation, data quality checks, Delta operations, and logging.
Establish unit testing and integration testing frameworks for Spark pipelines.

Security, Compliance & Networking

Configure workspace-level and account-level security: Private Link, IP access lists, secrets management via Databricks Secrets or AWS Secrets Manager.
Design and enforce network isolation for sensitive data workloads.
Ensure compliance with data residency and access control requirements for customer data.

Collaboration & Enablement

Partner with data engineers, data scientists, and analytics engineers to ensure the platform meets diverse workload needs.
Mentor the engineering team on Databricks, Spark optimization, and Lakehouse best practices.
Produce architectural documentation, runbooks, and internal knowledge bases.
Evaluate and recommend new Databricks features and third-party integrations relevant to the organization's data roadmap.

Required Qualifications

Core Databricks & Lakehouse

5+ years of hands-on experience with Databricks, with at least 2 years in an architect or senior lead role.
Deep expertise in Unity Catalog: metastore setup, three-level namespace, ACL design, and data governance workflows.
Strong mastery of the Medallion Architecture and Delta Lake: ACID transactions, time travel, compaction, and OPTIMIZE/VACUUM strategies.
Proven experience designing and deploying production pipelines with Databricks Jobs and Workflows, including multi-task job DAGs, retry logic, and notifications.
Hands-on experience with Databricks Repos and CI/CD integration for notebook and Python library deployments.
Experience configuring and operating Serverless SQL Warehouses and Serverless compute for Jobs.

Apache Spark

Expert-level PySpark development: DataFrames, Spark SQL, window functions, broadcast joins, and UDFs.
Strong understanding of Spark internals: DAG execution, shuffle optimization, memory management, and speculative execution.
Experience with structured streaming and micro-batch processing patterns.
Proven ability to diagnose and resolve Spark performance issues using Spark UI and event logs.

Python & Software Engineering

Advanced Python skills with a strong software engineering background: packaging, testing (pytest), virtual environments, and dependency management.
Experience building modular Python libraries for data engineering use cases.
Familiarity with common data engineering libraries: pandas, pydantic, great_expectations or similar DQ frameworks.

Cloud & Infrastructure

Experience deploying Databricks on AWS, including workspace provisioning, IAM integration, and VPC configuration.
Familiarity with cloud-native storage (S3/ADLS), external locations in Unity Catalog, and storage credentials management.
Exposure to infrastructure-as-code tooling (Terraform, Databricks Asset Bundles, or similar).

Preferred Qualifications

Databricks Certified Data Engineer Professional or Databricks Certified Associate Developer for Apache Spark certifications.
Experience with Delta Live Tables (DLT) for declarative pipeline authoring.
Familiarity with dbt (data build tool) integrated with Databricks SQL.
Experience with Databricks Feature Store or MLflow for ML platform use cases.
Exposure to Databricks Marketplace and Partner Connect integrations.
Experience with Elasticsearch, Apache Kafka, or other streaming/search technologies complementary to the Lakehouse.