Search by job, company or skills

logic hire solutions ltd

Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)

5-7 Years
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Title: Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS)

Location: [Remote / Hybrid / Specific Location]

Experience Level: 5+ Years (Mid-Senior to Senior)

Position Overview

We are undergoing a fundamental shift in our data infrastructure, moving away from legacy on-premise Cloudera (CDH/HDP) environments toward a modern, hybrid-cloud data mesh architecture spanning Google Cloud Platform (GCP) and Amazon Web Services (AWS) .

We are looking for a Senior Data Engineer who does not just use these platforms but has built them from the ground up. The ideal candidate has the scars and medals from leading large-scale migration projects—specifically, the re-platforming of Hive/Impala workloads and HDFS datasets to cloud-native storage and compute (Snowflake/Databricks). You will be responsible for writing high-performance Python code, optimizing Spark jobs that process petabytes of data, and ensuring our real-time streaming infrastructure (Kafka/PubSub) is rock-solid.

Detailed Tech Stack & EnvironmentCategorySpecific Technologies & Tools Used DailyLanguagesPython 3.9+ (Advanced: Decorators, Generators, Multiprocessing, Pydantic, Poetry), PySpark, SQL (ANSI & BigQuery Dialect), Scala (Maintenance only).Compute & ProcessingApache Spark 3.x (DataFrames, Structured Streaming), Databricks (Delta Live Tables, Photon, Unity Catalog), GCP Dataproc (Serverless & Cluster Mode), AWS EMR (on EC2 & EKS).Streaming & MessagingApache Kafka (Schema Registry, Avro), GCP Pub/Sub, AWS Kinesis Data Streams, Debezium (CDC).Storage & WarehouseSnowflake (Snowpipe Streaming, Streams & Tasks, Time Travel), GCP BigQuery (BI Engine, Materialized Views), AWS S3, GCP Cloud Storage, Delta Lake / Apache Iceberg.Orchestration & OpsApache Airflow 2.x (GCP Cloud Composer, AWS MWAA), dbt Core/Cloud, Terraform (IaC), Docker, GitHub Actions / Jenkins.Legacy (Migration Source)Cloudera CDH/HDP, Apache Hive, Apache Impala, Oozie, HDFS.Detailed Must-Have Responsibilities & Technical Expectations1. Core Software Engineering in Python (Deep Dive)

  • Requirement: 5+ years of professional experience in software engineering.
  • Detailed Expectations:
    • Code Quality: You treat data pipelines as software products. You enforce unit testing (PyTest), integration testing, and CI/CD for all Spark jobs.
    • Optimization: You possess the ability to debug JVM garbage collection issues in Spark UDFs and refactor them into native Spark SQL functions or Pandas UDFs (Vectorized) to achieve 10x performance improvements.
    • Modularity: You design reusable Python packages and libraries for data ingestion, validation (Great Expectations), and logging that are shared across GCP and AWS environments.


  • Large-Scale Migration Expertise (The Proven Requirement - Expanded)
  • Requirement: Proven, hands-on experience migrating from Cloudera (CDH/HDP) to Snowflake or Databricks.
  • Detailed Scope of Work You Will Own:
    • Legacy Decommissioning: You will analyze existing Hive Metastore schemas and Impala query patterns to design a migration strategy to BigQuery or Snowflake.
    • Data Transfer: You will architect and execute the transfer of 100s of TBs from HDFS to GCS/S3. This includes:
      • Utilizing DistCp for initial bulk copy.
      • Implementing incremental sync strategies using HDFS Snapshots and cloud object versioning.
      • Converting Hive table formats (ORC/Parquet) to optimal cloud storage layouts (partitioned Parquet/Delta).
    • Workflow Refactoring: You will reverse-engineer complex Oozie workflows and rebuild them as robust, idempotent DAGs in Apache Airflow.
    • Cloud-Native Feature Adoption: You will replace batch INSERT OVERWRITE jobs with Snowpipe Streaming or Databricks Auto Loader to reduce latency from hours to seconds.
  • Greenfield Platform Architecture & Integration
  • Requirement: Deep technical expertise in building a high-performance data platform from scratch.
  • Detailed Architecture Deliverables:
    • Reverse ETL: You will build secure pipelines from Snowflake/BigQuery back into operational systems (Salesforce, HubSpot, Postgres) using Apache Beam (Dataflow) or AWS Lambda with custom retry logic and rate limiting.
    • CDC Implementation: You will design a Change Data Capture pipeline using Kafka Connect (Debezium) -> Pub/Sub -> Dataflow -> BigQuery, ensuring exactly-once semantics and handling schema evolution seamlessly.
    • API Ingestions: You will build a serverless ingestion framework on GCP Cloud Functions (Python) that pulls data from 3rd party REST APIs, handles pagination/authentication, and lands raw JSON into Cloud Storage partitioned by date.
  • Data Modeling & Distributed Systems Expertise
  • Requirement: Expertise in structured/unstructured data and distributed systems.
  • Detailed Expectations:
    • Modeling: You can explain the trade-offs between Kimball Star Schema vs. Data Vault 2.0 vs. One Big Table (OBT) and implement the correct approach for specific analytical use cases in dbt.
    • Spark Tuning: You are comfortable looking at the Spark UI, diagnosing Data Skew (Salting Keys), optimizing Shuffle Partitions, and managing Broadcast Joins to prevent executor OOM errors on Dataproc and EMR.
    • Streaming Architecture: You understand the implications of Event Sourcing and can tune Kafka retention policies and Pub/Sub subscription backlogs to ensure data durability during consumer downtime.
  • Cloud Platform Proficiency (GCP Focus)
  • Requirement: Experience with GCP (BigQuery, Dataflow, Pub/Sub, Dataproc).
  • Detailed GCP Operations:
    • BigQuery: You will enforce cost governance using Table Partitioning and Clustering to minimize query bytes processed. You will utilize BigQuery Omni if cross-cloud analytics on AWS data is required.
    • Dataflow: You will write Apache Beam pipelines in Python that handle late-arriving data using Windowing and Watermarks.
    • Networking: Understanding of VPC Service Controls and Private Service Connect to ensure data never traverses the public internet.
Detailed Nice-to-Have Qualifications

  • Databricks & Unity Catalog: Experience implementing fine-grained access control and lineage using Unity Catalog in a multi-workspace environment.
  • NoSQL & Graph:
    • Redis: Experience implementing Redis as a distributed cache for Lookup Tables in Spark Streaming jobs to reduce latency on joins against BigQuery.
    • Neo4j: Knowledge of building identity resolution graphs or supply chain dependencies using Cypher queries.
  • Infrastructure as Code (Terraform): Ability to write Terraform modules to provision GCP Service Accounts, BigQuery Datasets, IAM Bindings, and AWS Glue Catalogs in a repeatable manner.
  • Machine Learning Integration: Experience building Feature Stores on Databricks or using BigQuery ML for batch inference directly within the warehouse.
Skills: gcs,aws,gcp bigquery,dataflow,docker,cloud,gcp,cloudera,data

More Info

Job Type:
Industry:
Employment Type:

Job ID: 146496087

Similar Jobs