Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)

logic hire solutions ltd

India

5-7 Years

Save

Posted 3 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Title: Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)

Location: [Remote / Hybrid / Specific Location]

Experience Level: 5+ Years (Mid-Senior to Senior)

Position Overview

We are undergoing a fundamental shift in our data infrastructure, moving away from legacy on-premise Cloudera (CDH/HDP) environments toward a modern, hybrid-cloud data mesh architecture spanning Google Cloud Platform (GCP) and Amazon Web Services (AWS) .

We are looking for a Senior Data Engineer who does not just use these platforms but has built them from the ground up. The ideal candidate has the scars and medals from leading large-scale migration projects—specifically, the re-platforming of Hive/Impala workloads and HDFS datasets to cloud-native storage and compute (Snowflake/Databricks). You will be responsible for writing high-performance Python code, optimizing Spark jobs that process petabytes of data, and ensuring our real-time streaming infrastructure (Kafka/PubSub) is rock-solid.

Detailed Tech Stack & EnvironmentCategorySpecific Technologies & Tools Used DailyLanguagesPython 3.9+ (Advanced: Decorators, Generators, Multiprocessing, Pydantic, Poetry), PySpark, SQL (ANSI & BigQuery Dialect), Scala (Maintenance only).Compute & ProcessingApache Spark 3.x (DataFrames, Structured Streaming), Databricks (Delta Live Tables, Photon, Unity Catalog), GCP Dataproc (Serverless & Cluster Mode), AWS EMR (on EC2 & EKS).Streaming & MessagingApache Kafka (Schema Registry, Avro), GCP Pub/Sub, AWS Kinesis Data Streams, Debezium (CDC).Storage & WarehouseSnowflake (Snowpipe Streaming, Streams & Tasks, Time Travel), GCP BigQuery (BI Engine, Materialized Views), AWS S3, GCP Cloud Storage, Delta Lake / Apache Iceberg.Orchestration & OpsApache Airflow 2.x (GCP Cloud Composer, AWS MWAA), dbt Core/Cloud, Terraform (IaC), Docker, GitHub Actions / Jenkins.Legacy (Migration Source)Cloudera CDH/HDP, Apache Hive, Apache Impala, Oozie, HDFS.Detailed Must-Have Responsibilities & Technical Expectations1. Core Software Engineering in Python (Deep Dive)

Requirement: 5+ years of professional experience in software engineering.
Detailed Expectations:

Code Quality: You treat data pipelines as software products. You enforce unit testing (PyTest), integration testing, and CI/CD for all Spark jobs.
Optimization: You possess the ability to debug JVM garbage collection issues in Spark UDFs and refactor them into native Spark SQL functions or Pandas UDFs (Vectorized) to achieve 10x performance improvements.
Modularity: You design reusable Python packages and libraries for data ingestion, validation (Great Expectations), and logging that are shared across GCP and AWS environments.

Large-Scale Migration Expertise (The Proven Requirement - Expanded)
Requirement: Proven, hands-on experience migrating from Cloudera (CDH/HDP) to Snowflake or Databricks.
Detailed Scope of Work You Will Own:

Legacy Decommissioning: You will analyze existing Hive Metastore schemas and Impala query patterns to design a migration strategy to BigQuery or Snowflake.
Data Transfer: You will architect and execute the transfer of 100s of TBs from HDFS to GCS/S3. This includes:

Utilizing DistCp for initial bulk copy.
Implementing incremental sync strategies using HDFS Snapshots and cloud object versioning.
Converting Hive table formats (ORC/Parquet) to optimal cloud storage layouts (partitioned Parquet/Delta).

Workflow Refactoring: You will reverse-engineer complex Oozie workflows and rebuild them as robust, idempotent DAGs in Apache Airflow.
Cloud-Native Feature Adoption: You will replace batch INSERT OVERWRITE jobs with Snowpipe Streaming or Databricks Auto Loader to reduce latency from hours to seconds.

Greenfield Platform Architecture & Integration
Requirement: Deep technical expertise in building a high-performance data platform from scratch.
Detailed Architecture Deliverables:

Reverse ETL: You will build secure pipelines from Snowflake/BigQuery back into operational systems (Salesforce, HubSpot, Postgres) using Apache Beam (Dataflow) or AWS Lambda with custom retry logic and rate limiting.
CDC Implementation: You will design a Change Data Capture pipeline using Kafka Connect (Debezium) -> Pub/Sub -> Dataflow -> BigQuery, ensuring exactly-once semantics and handling schema evolution seamlessly.
API Ingestions: You will build a serverless ingestion framework on GCP Cloud Functions (Python) that pulls data from 3rd party REST APIs, handles pagination/authentication, and lands raw JSON into Cloud Storage partitioned by date.

Data Modeling & Distributed Systems Expertise
Requirement: Expertise in structured/unstructured data and distributed systems.
Detailed Expectations:

Modeling: You can explain the trade-offs between Kimball Star Schema vs. Data Vault 2.0 vs. One Big Table (OBT) and implement the correct approach for specific analytical use cases in dbt.
Spark Tuning: You are comfortable looking at the Spark UI, diagnosing Data Skew (Salting Keys), optimizing Shuffle Partitions, and managing Broadcast Joins to prevent executor OOM errors on Dataproc and EMR.
Streaming Architecture: You understand the implications of Event Sourcing and can tune Kafka retention policies and Pub/Sub subscription backlogs to ensure data durability during consumer downtime.

Cloud Platform Proficiency (GCP Focus)
Requirement: Experience with GCP (BigQuery, Dataflow, Pub/Sub, Dataproc).
Detailed GCP Operations:

BigQuery: You will enforce cost governance using Table Partitioning and Clustering to minimize query bytes processed. You will utilize BigQuery Omni if cross-cloud analytics on AWS data is required.
Dataflow: You will write Apache Beam pipelines in Python that handle late-arriving data using Windowing and Watermarks.
Networking: Understanding of VPC Service Controls and Private Service Connect to ensure data never traverses the public internet.

Detailed Nice-to-Have Qualifications

Databricks & Unity Catalog: Experience implementing fine-grained access control and lineage using Unity Catalog in a multi-workspace environment.
NoSQL & Graph:

Redis: Experience implementing Redis as a distributed cache for Lookup Tables in Spark Streaming jobs to reduce latency on joins against BigQuery.
Neo4j: Knowledge of building identity resolution graphs or supply chain dependencies using Cypher queries.

Infrastructure as Code (Terraform): Ability to write Terraform modules to provision GCP Service Accounts, BigQuery Datasets, IAM Bindings, and AWS Glue Catalogs in a repeatable manner.
Machine Learning Integration: Experience building Feature Stores on Databricks or using BigQuery ML for batch inference directly within the warehouse.

Skills: gcs,aws,gcp bigquery,dataflow,docker,cloud,gcp,cloudera,data