Job Description
We are looking for a skilled Data Engineer with strong experience in Apache Spark to design, build, and optimize large-scale data pipelines in a distributed environment. The ideal candidate has hands-on expertise in modern data engineering practices, cloud platforms, and scalable data processing frameworks.
Key Responsibilities
- Design, develop, and maintain ETL/ELT pipelines using Apache Spark (batch and/or streaming).
- Build and optimize distributed data processing workflows on Spark (PySpark/Scala/Java).
- Work with cloud-based data ecosystems (AWS, GCP, or Azure) to develop scalable data solutions.
- Collaborate with data scientists, analysts, and backend engineers to deliver reliable, highquality data products.
- Implement and maintain data quality checks, monitoring, and alerting for data pipelines.
- Optimize Spark jobs for performance, cost efficiency, and scalability.
- Manage and model data in data lakes, data warehouses, and/or structured storage systems.
- Contribute to data architecture design, including schema modeling, partitioning, and data lifecycle management.
- Automate infrastructure and pipeline deployments using CI/CD and IaC frameworks.
- Ensure compliance with data governance, security, and privacy standards.
Required Skills & Qualifications
- Strong hands-on experience with Apache Spark (batch or streaming).
- Proficiency in Python, Scala, or Java for data processing.
- Experience with at least one cloud platform (AWS, GCP, or Azure).
- Solid understanding of distributed systems, data partitioning, and performance tuning.
- Hands-on experience with data lake technologies (e.g., S3, GCS, Azure Data Lake).
- Experience with relational databases and SQL.
- Familiarity with CI/CD workflows and version control (Git).
- Experience with Infrastructure-as-Code tools (Terraform, CloudFormation, etc.) is a plus.
- Knowledge of workflow orchestration tools such as Airflow, Dagster, or Prefect.
- Strong problemsolving skills and ability to work in crossfunctional teams.
Preferred Qualifications (Optional)
- Experience with Spark on Kubernetes, Databricks, EMR, or Dataproc.
- Knowledge of streaming technologies (Kafka, Pub/Sub, Kinesis).
- Familiarity with Delta Lake, Iceberg, or Hudi.
- Background in data modeling (ELT/ETL design, star/snowflake schemas).
- Experience with realtime and nearrealtime data pipelines.