Role overview
We're looking for a Data Engineer who thrives on building robust, real-time and batch data products on Microsoft Fabric. You'll design and operate ingestion from streaming sources (Event Hubs/Service Bus/Confluent Kafka), model curated Silver/Gold layers in Lakehouse, optimize KQL/Spark pipelines, and enable trustworthy, fast Power BI dashboards (including Direct Lake and semantic models).
What you'll do
- Design and implement scalable data pipelines (batch + streaming) from diverse sources (REST, SFTP, RDBMS, Kafka/Event Hubs/Service Bus) into a lakehouse and OneLake.
- Model and curate datasets using medallion architecture; build reusable frameworks for ingestion, schema evolution, and incremental processing.
- Write efficient transformations in Spark (PySpark/SQL) and/or KQL; create materialized views, update policies, and optimization strategies for cost/perf.
- Implement CDC, watermarking, late-arrivals handling, and idempotent writes for append/merge scenarios.
- Enforce data quality, observability, and lineage (DQ rules, expectations, SLAs, alerts, metadata catalogs).
- Apply security & governance best practices (PII hashing/tokenization, access controls, secrets management).
- Productionize workloads with orchestration (Airflow/ADF/Azure Synapse/Step Functions/Glue), CI/CD, testing, and rollout strategies.
- Partner with product/analytics teams to define SLAs, table contracts, and consumption patterns; create reliable semantic layers.
- Troubleshoot performance, skew, and reliability issues; tune storage (Delta/Parquet/Iceberg) and compute configurations.
What you'll bring
- 6+ years of data engineering experience (title flexible: Data Engineer / Senior Data Engineer).
- Strong SQL and one of Python/Scala. Deep familiarity with Spark (PySpark/SQL) and distributed data patterns.
- Hands-on with one or more clouds (Azure/AWS/GCP) and a lakehouse stack (e.g., Databricks, Delta Lake, Fabric Lakehouse/Eventhouse, Synapse, BigQuery/Snowflake a plus).
- Streaming experience: Kafka/Confluent, Azure Event Hubs, or Service Bus; schema registry, exactly-once/at-least-once semantics.
- Solid understanding of medallion architecture, CDC, SCD, upserts/merge, partitioning, Z-ordering, compaction, and vacuum.
- Orchestration & DevOps: Airflow/ADF/Glue/Step Functions; Git-based workflows, unit/integration tests, environments, and IaC (Terraform/ARM/CDK) preferred.
- Data quality & governance: expectations/testing, lineage/metadata, RBAC/ABAC, PII protection (hashing/salting/tokenization).
- Comfortable owning services in production: monitoring, alerting, SLIs/SLOs, on-call rotation.
What You've Done (Must-Haves)
- 5+ years in data engineering with cloud data platforms (Azure preferred).
- Hands-on with Microsoft Fabric components: Eventhouse (KQL), Lakehouse (Delta on OneLake), Spark notebooks, Data Factory (Fabric pipelines), Power BI (including Direct Lake).
- Solid SQL/KQL/PySpark; comfort with nested JSON, mv-expand, update policies, materialized views, partitioning.
- Built production-grade streaming + batch pipelines; handled late/duplicate events, watermarking, and idempotency.
- Strong grasp of data modeling, performance tuning, and data quality (unit tests, anomaly checks, SLAs).
Nice to Have
- Confluent Kafka private networking patterns; CDC from operational stores.
- Azure ecosystem: ADLS/OneLake, Key Vault, AAD, Purview, Event Hubs, Service Bus.
- MLOps/feature store basics; Python packaging & testing (pytest).
- Governance & compliance (GDPR/CCPA), PII handling, and secrets management.
Tech Stack You'll Touch
- Microsoft Fabric: Eventhouse/KQL, Lakehouse/Delta, Spark notebooks, Data Factory, Power BI (Direct Lake)
- Azure: Event Hubs, Service Bus, AAD, Key Vault, Purview
- Langs/Tools: SQL, KQL, PySpark, Python, Git, CI/CD (ADO/GitHub)