Job Description
Project Role : Custom Software Engineer
Project Role Description : Develop custom software solutions to design, code, and enhance components across systems or applications. Use modern frameworks and agile practices to deliver scalable, high-performing solutions tailored to specific business needs.
Must have skills : PySpark
Good to have skills : NA
Minimum 3 Year(s) Of Experience Is Required
Educational Qualification : 15 years full time education
We are seeking a skilled Data Engineer to design, build, and optimize scalable data pipelines on the Enterprise Data Platform (EDL) running on Cloudera (CDP) on AWS. The role involves working with the Hadoop ecosystem, building PySpark-based data processing pipelines, orchestrating workflows using Oozie and Control-M, integrating with AWS services (S3, IAM, EC2), and delivering secure, reliable, cloud-ready data solutions.
Key Responsibilities
Data Engineering & Platform Development
Build scalable data ingestion and processing pipelines using Cloudera CDP on AWS, Hadoop (HDFS, Hive, YARN), PySpark/Spark SQL, and AWS S3.
Design data flows between HDFS and S3 using DistCp, Spark read/write, file compaction, archival, and lifecycle policies.
Develop and optimize PySpark jobs for performance, partitioning, caching, and YARN resource allocation.
Build FastAPI microservices for data access, metadata, and operational endpoints.
Design scalable data models across Raw, Presentation, and Data Provisioning layers.
Workflow Orchestration & Automation
Develop and manage Apache Oozie workflows and coordinators including triggers, SLAs, HDFS/S3 path management, and kill/recovery actions.
Implement enterprise scheduling using Control-M with dependencies, calendars, alerts, SLAs, and automated retries.
Automate operational tasks using Shell/Bash scripting for monitoring, file operations, HDFS maintenance, and backfill processes.
Cloud, Storage & Platform Operations
Work with Cloudera on AWS including Cloudera Manager, Hive on Tez, Spark on YARN, cluster scaling, queues, and capacity planning.
Use AWS services: S3 (storage, versioning, lifecycle, encryption) and optionally IAM, EMR, EC2, Lambda.
Implement data security and governance using Ranger, Kerberos, TLS, audit logs, and data masking/tokenization (nice-to-have).
API Engineering (FastAPI on YARN + Hive)
Build FastAPI REST services interacting with Hive tables (HiveServer2 / Impala / LLAP / JDBC/ODBC) and Spark jobs on YARN.
Develop APIs to submit Spark/PySpark jobs, track job status/logs/YARN application IDs, execute Hive queries, and return dataset results.
Expose metadata, lineage, health checks, and data insights via APIs.
Implement asynchronous APIs for long-running Spark jobs.
Develop FastAPI middleware for authentication, logging, monitoring, retries, and circuit breaking.
Quality, Monitoring & Reliability
Implement data quality checks including schema validation, null checks, and reconciliation.
Monitor using Spark UI, YARN RM metrics, and Cloudera Manager alerts.
Optimize cluster and application performance, reduce cost, and improve pipeline efficiency.
Perform production issue triage, RCA, and preventive automation.
Required Skills & Experience
310+ years of experience in Data Engineering.
Hands-on experience with Cloudera (CDH/CDP), Hadoop, HDFS, Hive/Impala, PySpark/Spark SQL, Oozie workflows/coordinators/SLA, Control-M scheduling, AWS S3 architecture.
Strong Python development for ETL frameworks, exception handling, and testing.
Strong Linux/Shell scripting skills.
Understanding of distributed systems concepts including shuffle, skew handling, broadcast joins, and spill tuning.
Experience with CI/CD tools such as Git, Jenkins, Azure DevOps, or GitHub Actions.
Nice-to-Have
Cloudera Machine Learning (CML) or Cloudera Data Science Workbench (CDSW)
Delta Lake / Iceberg / Hudi
Cloudera Manager administration
Data catalog & lineage (Atlas)
Exposure to Kafka, NiFi, Informatica
HBase
Core Competencies
Ownership and accountability
Strong analytical and performance tuning mindset
Collaboration with cross-functional teams
Excellent documentation and communication skills
, 15 years full time education