Job Description Principal Data Engineer (15+ Years) | Location - Bangalore
Role Summary
We are looking for a senior technical leader and hands-on architect a Principal Data Engineer to own the design and evolution of an AWS + Databricks enterprise lakehouse spanning ingestion (batch/streaming), transformation (canonical/Silver), and curated insights (Gold), with strong governance, security, performance, and cost controls. This role sets engineering standards, mentors teams, and drives scalable delivery across multiple domains and source systems.
Key Responsibilities
Architecture & Platform Leadership
- Own end-to-end lakehouse architecture across Bronze/Silver/Gold and ensure best practices for:
- Raw landing on S3 + Glue Catalog
- Canonical modeling and transformations
- Curated datasets for analytics/AI/consumption
- Define standards for data layout, partitioning, file sizing, compaction, and Iceberg table management.
- Establish platform patterns for batch + streaming ingestion, orchestration, and automated deployment.
Data Engineering Delivery (Hands-On + Leadership)
- Lead implementation of scalable pipelines in Databricks (PySpark/Spark SQL) and AWS (Glue, Lambda, Step Functions where needed).
- Design and maintain robust data models (canonical/Silver) and curated marts (Gold) optimized for analytics and downstream use.
- Ensure pipelines meet SLA, reliability, security, and cost objectives.
Streaming and Near Real-Time Enablement
- Define approach for streaming ingestion and processing (e.g., Kafka/Kinesis patterns if applicable, structured streaming/micro-batch).
- Ensure correctness, idempotency, late arriving data handling, and replay strategies.
Governance, Security & Compliance
- Implement security architecture: IAM, least privilege, encryption (KMS), secrets management, network controls.
- Integrate catalog/lineage/governance (e.g., Atlan) with standardized metadata practices.
- Establish data access patterns including RBAC/ABAC and controlled data sharing with partners.
Performance, Reliability & FinOps
- Drive optimization on Databricks clusters, job tuning, caching strategies, and query performance.
- Implement observability: pipeline metrics, logs, lineage, and incident runbooks.
- Own cost optimization strategy: autoscaling, cluster policies, workload isolation, storage optimization.
Engineering Excellence & Team Enablement
- Create reference architectures, coding standards, reusable libraries, and delivery playbooks.
- Mentor data engineers and reviewers; lead design reviews, code reviews, and production readiness checks.
- Collaborate with stakeholders across Analytics/BI, AI/ML, and product teams.
Must-Have Skills & Experience
- 15+ years in Data Engineering with proven leadership owning enterprise-scale platforms.
- Expert-level Databricks: Spark architecture, PySpark optimization, Spark SQL, workflows, job orchestration, cluster policies.
- Deep AWS expertise: S3, Glue, Lake Formation (if used), IAM, CloudWatch, KMS, VPC/security controls.
- Strong experience with Lakehouse table formats: Iceberg (preferred) / Delta / Hudi and parquet optimization.
- Strong architecture skills for data ingestion, canonical modeling, and curated layer design.
- Strong hands-on coding in Python and advanced SQL.
- Experience implementing CI/CD for data (Git branching, deployment automation, environment promotion).
- Experience designing for analytics consumption: semantic layer readiness, BI/Power BI integration patterns.
Nice-to-Have
- Experience with Unity Catalog, multi-workspace governance, data sharing, and fine-grained access controls.
- Exposure to data virtualization patterns (semantic/virtualization layer) and federation strategies.
- AI/ML enablement experience (feature datasets, training data pipelines, governance for GenAI/LLM apps).
- Experience integrating enterprise apps (ERP, ServiceNow, Workday, factory systems like MES).
Qualifications
- Bachelor's/Master's in CS/Engineering or equivalent.
- Strong stakeholder management and ability to drive decisions across teams.