Design and implement data pipelines, ETL processes, schemas, and data models to ingest, process, and prepare multi-petabyte scale datasets for downstream analytics and machine learning.
Build and optimize data processing systems on modern platforms like Spark, Delta Lake, Kafka, etc.
Implement data quality, validation, and monitoring measures leveraging tools such as Great Expectations.
Ensure compliance with security, access control, and regulatory requirements related to PHI and other sensitive data types.
Support adoption of emerging standards like FHIR for healthcare data exchange.
Collaborate with data scientists, analysts, and engineers to understand data needs and deliver performant, reliable data products
Keep track of emerging technologies & trends in the Data Engineering world, incorporating modern tooling and best practices.
Qualifications
Experience in building and operating production big data platforms and pipelines
Strong experience with SQL, Spark, workflow orchestrators, distributed message bus, Python, Presto, Deltalake, apache big data tool suites, Docker, Kubernetes, MPP
Hands on with the design and implementation of cloud-based data solutions using platforms like Azure, AWS, or GCP, optimizing for scalability, cost-efficiency, and performance.
Implement and maintain data lakes and warehouses, lakehouses including data modeling, ETL processes, and data quality assurance to empower data-driven decision-making.
Develop real-time data pipelines using streaming technologies like Apache Kafka or AWS Event hub, enabling timely insights and actions from incoming data streams.
Manage and enhance distributed data systems (e.g., Hadoop, Spark) to efficiently process large-scale datasets, ensuring data availability and reliability.
Previous experience of working on health data and Azure cloud is a strong plus
Experience with Databricks or MS Fabric
Strong track record of designing and implementing scalable data models, schemas, ETL logic
Experience with data governance, master data management, data pseudonimization and anonymization, and data catalog solutions .
A strong interest in learning new things and team player ethics.
Strong analytical skills and good understanding of data structures and algorithms.
Some exposure to Nextflo and or Nextflow Tower
Nice To Have
Experience building data pipelines for machine learning.
Knowledge of genomics, medical imaging, and/or EHR data domains
Knowledge of HIPAA, HL7 and other healthcare data privacy requirements
Hands on experience with fully managed data warehousing solutions Azure Synapse, AWS Redshift ,Bigquery, Snowflake etc: