Role : PySpark/Scala Developer
Experience : 5+
Location : Pan India
Functional Skills: Experience in Credit Risk/Regulatory risk domain
Technical Skills: Spark ,PySpark, Python, Hive, Scala, MapReduce, Unix shell scripting
Good to Have Skills: Exposure to Machine Learning Techniques
Job Description:
5+ Years of experience with Developing/Fine tuning and implementing programs/applications
Using Python/PySpark/Scala on Big Data/Hadoop Platform.
Roles and Responsibilities:
- Work with a Leading Bank's Risk Management team on specific projects/requirements pertaining to risk Models in
consumer and wholesale banking
- Enhance Machine Learning Models using PySpark or Scala
- Work with Data Scientists to Build ML Models based on Business Requirements and Follow ML Cycle to Deploy them all
- the way to Production Environment
- Participate Feature Engineering, Training Models, Scoring and retraining
- Architect Data Pipeline and Automate Data Ingestion and Model Jobs
Skills and competencies:
Required:
- Strong analytical skills in conducting sophisticated statistical analysis using bureau/vendor data, customer performance
- Data and macro-economic data to solve business problems.
- Working experience in languages PySpark & Scala to develop code to validate and implement models and codes in
- Credit Risk/Banking
- Experience with distributed systems such as Hadoop/MapReduce, Spark, streaming data processing, cloud architecture.
- Familiarity with machine learning frameworks and libraries (like scikit-learn, SparkML, tensorflow, pytorch etc.
- Experience in systems integration, web services, batch processing
- Experience in migrating codes to PySpark/Scala is big Plus
- The ability to act as liaison conveying information needs of the business to IT and data constraints to the business
applies equal conveyance regarding business strategy and IT strategy, business processes and work flow
- Flexibility in approach and thought process
- Attitude to learn and comprehend the periodical changes in the regulatory requirement as per FED
Education Qualification: Master's degree with a specialization in Statistics, Mathematics, Finance or Engineering Degree
Must-Have
- 5+ years of experience in data engineering, with strong focus on PySpark/python for big data processing.
- Expertise in building data pipelines and ingestion frameworks from relational, semi-structured (JSON, XML), and unstructured sources (logs, PDFs).
- Proficiency in Python with strong knowledge of data processing libraries.
- Strong SQL skills for querying and validating data in platforms like Amazon Redshift, PostgreSQL, or similar.
- Experience with distributed computing frameworks (e.g., Spark on EMR, Databricks).
- Familiarity with workflow orchestration tools (e.g., AWS Step Functions, or similar).
- Solid understanding of data lake / data warehouse architectures and data modeling basics.
Good-to-Have
- Familiarity with Delta Lake or similar for large-scale data storage.
- Exposure to real-time streaming frameworks (e.g., Spark Structured Streaming, Kafka).
- Knowledge of data governance, lineage, and cataloging tools (e.g., AWS Glue Catalog, Apache Atlas).
- Understanding of DevOps/CI-CD pipelines for data projects using Git, Jenkins, or similar tools.
Responsibility of / Expectations from the Role
- Design and build robust, scalable ETL/ELT pipelines using PySpark to ingest data from diverse sources (databases, logs, APIs, files).
- Transform and curate raw transactional and log data into analysis-ready datasets in the Data Hub and analytical data marts.
- Develop reusable and parameterized Spark jobs for batch and micro-batch processing.
- Optimize performance and scalability of PySpark jobs across large data volumes.
- Ensure data quality, consistency, lineage, and proper documentation across ingestion flows.
- Collaborate with Data Architects, Modelers, and Data Scientists to implement ingestion logic aligned with business needs.
- Work with cloud-based data platforms (e.g., AWS S3, Glue, EMR, Redshift) for data movement and storage.
- Support version control, CI/CD, and infrastructure-as-code where applicable.