Custom Software Engineer

Accenture

Pune, India

3-10 Years

Save

Posted 6 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Project Role : Custom Software Engineer

Project Role Description : Develop custom software solutions to design, code, and enhance components across systems or applications. Use modern frameworks and agile practices to deliver scalable, high-performing solutions tailored to specific business needs.

Must have skills : PySpark

Good to have skills : NA

Minimum 3 Year(s) Of Experience Is Required

Educational Qualification : 15 years full time education

We are seeking a skilled Data Engineer to design, build, and optimize scalable data pipelines on the Enterprise Data Platform (EDL) running on Cloudera (CDP) on AWS. The role involves working with the Hadoop ecosystem, building PySpark-based data processing pipelines, orchestrating workflows using Oozie and Control-M, integrating with AWS services (S3, IAM, EC2), and delivering secure, reliable, cloud-ready data solutions.

Key Responsibilities

Data Engineering & Platform Development

Build scalable data ingestion and processing pipelines using Cloudera CDP on AWS, Hadoop (HDFS, Hive, YARN), PySpark/Spark SQL, and AWS S3.

Design data flows between HDFS and S3 using DistCp, Spark read/write, file compaction, archival, and lifecycle policies.

Develop and optimize PySpark jobs for performance, partitioning, caching, and YARN resource allocation.

Build FastAPI microservices for data access, metadata, and operational endpoints.

Design scalable data models across Raw, Presentation, and Data Provisioning layers.

Workflow Orchestration & Automation

Develop and manage Apache Oozie workflows and coordinators including triggers, SLAs, HDFS/S3 path management, and kill/recovery actions.

Implement enterprise scheduling using Control-M with dependencies, calendars, alerts, SLAs, and automated retries.

Automate operational tasks using Shell/Bash scripting for monitoring, file operations, HDFS maintenance, and backfill processes.

Cloud, Storage & Platform Operations

Work with Cloudera on AWS including Cloudera Manager, Hive on Tez, Spark on YARN, cluster scaling, queues, and capacity planning.

Use AWS services: S3 (storage, versioning, lifecycle, encryption) and optionally IAM, EMR, EC2, Lambda.

Implement data security and governance using Ranger, Kerberos, TLS, audit logs, and data masking/tokenization (nice-to-have).

API Engineering (FastAPI on YARN + Hive)

Build FastAPI REST services interacting with Hive tables (HiveServer2 / Impala / LLAP / JDBC/ODBC) and Spark jobs on YARN.

Develop APIs to submit Spark/PySpark jobs, track job status/logs/YARN application IDs, execute Hive queries, and return dataset results.

Expose metadata, lineage, health checks, and data insights via APIs.

Implement asynchronous APIs for long-running Spark jobs.

Develop FastAPI middleware for authentication, logging, monitoring, retries, and circuit breaking.

Quality, Monitoring & Reliability

Implement data quality checks including schema validation, null checks, and reconciliation.

Monitor using Spark UI, YARN RM metrics, and Cloudera Manager alerts.

Optimize cluster and application performance, reduce cost, and improve pipeline efficiency.

Perform production issue triage, RCA, and preventive automation.

Required Skills & Experience

310+ years of experience in Data Engineering.

Hands-on experience with Cloudera (CDH/CDP), Hadoop, HDFS, Hive/Impala, PySpark/Spark SQL, Oozie workflows/coordinators/SLA, Control-M scheduling, AWS S3 architecture.

Strong Python development for ETL frameworks, exception handling, and testing.

Strong Linux/Shell scripting skills.

Understanding of distributed systems concepts including shuffle, skew handling, broadcast joins, and spill tuning.

Experience with CI/CD tools such as Git, Jenkins, Azure DevOps, or GitHub Actions.

Nice-to-Have

Cloudera Machine Learning (CML) or Cloudera Data Science Workbench (CDSW)

Delta Lake / Iceberg / Hudi

Cloudera Manager administration

Data catalog & lineage (Atlas)

Exposure to Kafka, NiFi, Informatica

HBase

Core Competencies

Ownership and accountability

Strong analytical and performance tuning mindset

Collaboration with cross-functional teams

Excellent documentation and communication skills

, 15 years full time education