Senior Data Engineer

Siemens

Bengaluru

5-10 Years

This job is no longer accepting applications

Posted 4 months ago
Over 100 applicants

Job Description

Key Responsibilities

Design & Architect Scalable Data Pipelines: Architect, build, and optimize high-throughput ETL pipelines using AWS Glue, Lambda, and EMR to handle large datasets and complex data workflows. Ensure the pipeline scales efficiently and handles real-time and batch processing.
Cloud Data Infrastructure Management: Implement, monitor, and maintain a cloud-native data infrastructure using AWS services like S3 for data storage, Redshift for data warehousing, and EMR for big data processing. Build robust, cost-effective solutions for storing, processing, and querying large datasets efficiently.
Data Transformation & Processing: Develop highly performant data transformation processes using Apache Spark on EMR for distributed data processing and parallel computation. Write optimized Spark jobs in Python (PySpark) for efficient data transformation.
Real-time Data Streaming Solutions: Design and implement real-time data ingestion and streaming systems using AWS Kinesis or Apache Kafka to handle event-driven architectures, process continuous data streams, and support real-time analytics.
Orchestration & Automation: Use Apache Airflow to schedule and orchestrate complex ETL workflows. Automate data pipeline processes, ensuring reliability, data integrity, and ease of monitoring. Implement self-healing workflows to recover from failures automatically.
Data Warehouse Optimization & Management: Develop and optimize data models, schemas, and queries in Amazon Redshift to ensure low-latency querying and scalable analytics. Apply best practices for data partitioning, indexing, and query optimization to increase performance and minimize costs.
Containerization & Orchestration: Leverage Docker to containerize data engineering applications for better portability and consistent runtime environments. Use AWS Fargate for running containerized applications in a serverless environment, ensuring easy scaling and reduced operational overhead.
Monitoring & Debugging: Build automated monitoring and alerting systems to proactively detect and troubleshoot pipeline issues, ensuring data quality and operational efficiency. Use tools like CloudWatch, Prometheus, or other logging frameworks to ensure end-to-end visibility of data pipelines.
Collaboration with Cross-functional Teams: Work closely with data scientists, analysts, and application developers to design data models and ensure proper data availability. Collaborate in the development of solutions that meet the business's data needs, from experimentation to production.
Security & Compliance: Implement data governance policies, security protocols, and compliance measures for handling sensitive data, including encryption, auditing, and IAM role-based access control in AWS.

Required Skills and Experience

5+ years of hands-on experience in building, maintaining, and optimizing data pipelines, ideally in a cloud-native environment.
ETL Expertise: Solid understanding of ETL/ELT processes and experience with tools like AWS Glue for building serverless ETL pipelines. Expertise in designing data transformation logic to move and process data efficiently across systems.
AWS Services: Deep experience working with AWS cloud services:
S3: Designing data lakes, ensuring scalability and performance.
AWS Glue: Writing custom jobs for transforming data.
Lambda: Writing event-driven functions to process and transform data on-demand.
Redshift: Optimizing data warehousing operations for efficient query performance.
EMR (Elastic MapReduce): Running distributed processing frameworks like Apache Spark or Hadoop to process large datasets.
Big Data Technologies: Expertise in using Apache Spark for distributed data processing at scale. Experience with real-time data processing using Apache Kafka and AWS Kinesis for building streaming data pipelines.
Data Orchestration: Strong experience with Apache Airflow or similar workflow orchestration tools for scheduling, monitoring, and managing ETL jobs and data workflows.
Programming & Scripting: Proficiency in Python programming language for building custom data pipelines and Spark jobs. Knowledge of standard processes in coding for high performance, maintainability, and reliability.
SQL & Query Optimization: Advanced knowledge of SQL and experience in query optimization, partitioning, and indexing for working with large datasets in Redshift and other data platforms.
CI/CD & DevOps Tools: Experience with version control systems like Git and implementing CI/CD pipelines using tools like Terraform or AWS CloudFormation to automate deployment and infrastructure management.

Preferred Qualifications

Data Streaming: Experience in designing and building real-time data streaming solutions using Kafka or Kinesis for real-time analytics and event processing.
Data Governance & Security: Familiarity with data governance practices, data cataloging, and data lineage tools to ensure the quality and security of data.
Advanced Data Analytics Support: Knowledge of supporting machine learning pipelines and building data systems that can scale to meet the requirements of AI/ML workloads.
Certifications: AWS certification.