Role Overview:
We're looking for a Junior Data Engineer to join our Data Platform team. You'll design and maintain scalable data pipelines and architectures using AWS services, enabling reliable data movement, transformation, and analytics at scale.
You'll collaborate with analytics, product, and engineering teams to support reporting, dashboards, and insights for millions of students and schools.
Key Responsibilities:
- Design, build, and maintain ETL/ELT pipelines for large-scale data ingestion, transformation, and loading.
- Develop and optimize Spark and PySpark jobs for batch and real-time data processing.
- Work with AWS services such as S3, Glue, Lambda, Redshift, Athena, and EMR to manage the data ecosystem.
- Support the design and implementation of Data Lake and Data Warehouse architectures.
- Implement data validation, partitioning, and schema management for efficient query performance.
- Collaborate with data analysts and BI teams to ensure data availability and consistency.
- Maintain data lineage, metadata, and ensure data quality and governance.
- Implement monitoring and alerting for data ingestion and transformation pipelines.
- Use Git and CI/CD tools to manage code and automate deployment of data workflows.
Qualifications:
- Bachelor's degree in Computer Science, Information Technology, Data Engineering, or related field.
- 13 years of hands-on experience in data engineering, data pipeline development, or cloud-based data systems.
- Strong knowledge of SQL and experience with Python or PySpark.
- Practical experience with AWS data stack S3, Glue, Lambda, Redshift, Athena, EMR, Step Functions, etc.
- Understanding of data lake architecture, ETL/ELT frameworks, and data warehousing concepts.
- Familiarity with Delta Lake, Spark SQL, or big data frameworks.
- Good understanding of data modeling, partitioning, and performance tuning.
- Excellent analytical, troubleshooting, and collaboration skills.
Good to Have / Plus
- Exposure to GCP (BigQuery, Dataflow, Cloud Storage) or Azure (Data Factory, Synapse, ADLS, Databricks).
- Experience with Databricks for scalable data processing and Delta Lake management.
- Knowledge of PostgreSQL, MySQL, or NoSQL databases.
- Familiarity with Airflow, Step Functions, or other orchestration tools.
- Understanding of DevOps practices, CI/CD pipelines, and infrastructure automation.
- Experience working in an EdTech or public data ecosystem is an advantage