Key Responsibilities:
- Data Modeling and ETL: Design, build, optimize, and maintain data models and ETL processes to support business requirements.
- Data Pipeline Development: Develop end-to-end data pipelines and workflows from source to target systems.
- Data Infrastructure Management: Build, deploy, and manage scalable data infrastructure to handle large volumes of data efficiently.
- Data Access and Security: Coordinate data access and security to ensure analysts and data scientists can easily access required data.
- Programming and Data Transformation: Develop Python and PySpark programs for data analysis, custom frameworks for rules generation, and business transformation logic using Spark DataFrames/RDDs.
- Integration with Big Data Ecosystem: Work with HBase, Hive, and other big data tools to design and implement efficient data solutions.
- DevOps and Deployment: Implement DevOps best practices for data pipelines and related infrastructure deployment.
Required Education:
- Bachelor's Degree in Computer Science, Information Technology, or a related field
Preferred Education:
- Master's Degree in a relevant field
Required Technical and Professional Expertise:
- 5+ years of experience in Big Data technologies including Hadoop, Spark, Scala, and Python
- Experience with HBase and Hive
- Developed Python and PySpark programs for data ingestion, analysis, and transformation
- Worked on building custom Python frameworks for rules generation
- Experience in reading/writing data from HBase using PySpark and applying business transformations using Spark DataFrames/RDDs
- Familiarity with AWS tools such as S3, Athena, DynamoDB, Lambda, Jenkins, and Git
Preferred Technical and Professional Expertise:
- Understanding of DevOps principles and practices
- Experience in building scalable, end-to-end data ingestion and processing solutions
- Strong knowledge of object-oriented and functional programming with Python, Java, or Scala