Key Responsibilities:
- Data Pipeline Development: Design, build, and maintain scalable ETL (Extract, Transform, Load) data pipelines using Azure Databricks, Apache Spark, and Python.
- Spark Optimization: Develop and optimize Spark jobs for large-scale data processing on Databricks. Ensure that the jobs run efficiently, leveraging the capabilities of distributed computing for optimal performance.
- Data Integration: Integrate data from various sources, including structured and unstructured data, into the Azure cloud environment using Databricks and related tools.
- Collaboration with Data Scientists & Analysts: Collaborate with data scientists, analysts, and business stakeholders to understand data requirements and deliver robust data solutions that enable advanced analytics, machine learning, and reporting.
- Azure Integration: Work closely with Azure services such as Azure Data Lake, Azure SQL Database, Azure Blob Storage, Azure Synapse Analytics, and Azure Data Factory for comprehensive data processing solutions.
- Data Transformation: Use Spark SQL, PySpark, and Databricks notebooks to perform data transformations and enable the conversion of raw data into actionable insights.
- Automation & Scheduling: Implement automated job scheduling and orchestration for regular data processing tasks, ensuring data is consistently processed and available for downstream consumption.
- Performance Tuning & Troubleshooting: Optimize the performance of data workflows and Spark applications on Databricks. Troubleshoot and resolve data-related issues and bottlenecks.
- Cloud Security: Ensure that data security and compliance standards are followed for cloud-based solutions, including managing data access, encryption, and auditing within the Azure Databricks environment.
- Monitoring & Logging: Implement logging and monitoring practices for the Azure Databricks environment to track job performance, failures, and troubleshooting efforts.
- Documentation & Best Practices: Maintain proper documentation for data pipelines, processes, and technical workflows. Follow best practices for coding, version control, and deployment.
- Stay Updated with Technology Trends: Keep up to date with the latest developments in Azure Databricks, Apache Spark, and related technologies. Apply new techniques to improve performance and scalability.
Required Qualifications & Skills:
- 3-5 years of hands-on experience in data engineering and working with Azure Databricks.
- Strong proficiency in Apache Spark, particularly in Databricks for building large-scale data pipelines and distributed data processing applications.
- Solid experience with Azure cloud services, including Azure Data Lake, Azure SQL Database, Azure Blob Storage, Azure Synapse, and Azure Data Factory.
- Proficiency in Python, Scala, or SQL for data engineering tasks, with a focus on PySpark for data processing.
- Experience working with structured and unstructured data from a variety of sources, including relational databases, APIs, and flat files.
- Familiarity with Databricks notebooks for developing, testing, and sharing data workflows, and using them for collaboration.
- In-depth understanding of ETL processes, data pipelines, and data transformation techniques.
- Hands-on experience with cloud-based data storage solutions (e.g., Azure Data Lake, Blob Storage) and data warehousing concepts.
- Knowledge of data security best practices in a cloud environment (e.g., data encryption, access controls, Azure Active Directory).
- Experience with CI/CD pipelines and version control systems like Git.
- Familiarity with containerization and deployment practices using Docker and Kubernetes is a plus.
- Strong debugging, performance tuning, and problem-solving skills.
- Excellent written and verbal communication skills, with the ability to collaborate effectively across teams.
- Bachelor's degree in Computer Science, Information Technology, or a related field.