Apache Spark/Airflow Data Engineer

Google

India

Fresher

Save

Posted 3 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Description

We are looking for a skilled Data Engineer with strong experience in Apache Spark to design, build, and optimize large-scale data pipelines in a distributed environment. The ideal candidate has hands-on expertise in modern data engineering practices, cloud platforms, and scalable data processing frameworks.

Key Responsibilities

Design, develop, and maintain ETL/ELT pipelines using Apache Spark (batch and/or streaming).
Build and optimize distributed data processing workflows on Spark (PySpark/Scala/Java).
Work with cloud-based data ecosystems (AWS, GCP, or Azure) to develop scalable data solutions.
Collaborate with data scientists, analysts, and backend engineers to deliver reliable, highquality data products.
Implement and maintain data quality checks, monitoring, and alerting for data pipelines.
Optimize Spark jobs for performance, cost efficiency, and scalability.
Manage and model data in data lakes, data warehouses, and/or structured storage systems.
Contribute to data architecture design, including schema modeling, partitioning, and data lifecycle management.
Automate infrastructure and pipeline deployments using CI/CD and IaC frameworks.
Ensure compliance with data governance, security, and privacy standards.

Required Skills & Qualifications

Strong hands-on experience with Apache Spark (batch or streaming).
Proficiency in Python, Scala, or Java for data processing.
Experience with at least one cloud platform (AWS, GCP, or Azure).
Solid understanding of distributed systems, data partitioning, and performance tuning.
Hands-on experience with data lake technologies (e.g., S3, GCS, Azure Data Lake).
Experience with relational databases and SQL.
Familiarity with CI/CD workflows and version control (Git).
Experience with Infrastructure-as-Code tools (Terraform, CloudFormation, etc.) is a plus.
Knowledge of workflow orchestration tools such as Airflow, Dagster, or Prefect.
Strong problemsolving skills and ability to work in crossfunctional teams.

Preferred Qualifications (Optional)

Experience with Spark on Kubernetes, Databricks, EMR, or Dataproc.
Knowledge of streaming technologies (Kafka, Pub/Sub, Kinesis).
Familiarity with Delta Lake, Iceberg, or Hudi.
Background in data modeling (ELT/ETL design, star/snowflake schemas).
Experience with realtime and nearrealtime data pipelines.