
Search by job, company or skills

About Humyn Labs
Humyn Labs builds the intelligence layer for physical-world AI — systems that perceive, reason, and act in real environments. Our work sits at the intersection of egocentric video understanding, embodied AI, robotics perception, and voice-driven interaction. We move fast, obsess over data quality, and ship at scale.
Humyn Labs converts human action - across sound, sight, movement, and touch - into high-quality multi-modal data signals for physical AI. Operating across 20+ countries in India, southeast Asia, Latin America, and the Middle East: the real-world environments where physical AI deploys, not the labs where it is built.
Our data isn't just collected; it's evaluated, defended, and production-ready. Because before AI can be trusted, its training data must be.
Role Overview
We are looking for a Senior Data Engineer to own, extend, and harden the multimodal data infrastructure that powers our AI training data supply chain. You will work directly with the Head of Data, building and maintaining production-grade pipelines that move real-world audio, video, and egocentric capture data from ingestion through AWS-scale storage, into GPU-backed technical validation, and finally to delivery-ready datasets for frontier AI lab customers.
This role sits at the intersection of data engineering and ML infrastructure — you will own the full pipeline: ingestion, lake management, quality validation, and compute handoff.
What You Will Work On
Multimodal Ingestion & Preprocessing
Build and maintain pipelines that ingest real-world audio, egocentric video, and image data at scale — covering format normalization, chunking, metadata extraction, and landing into S3-based storage with consistent schema and partitioning.
AWS Data Lake
Manage and extend the KGen data lake: Athena query optimization, Glue crawlers and cataloguing, Apache Hudi table management, Lake Formation column-level permissions, and S3 lifecycle policies at TB–PB scale.
GPU Validation Pipeline Handoff
Design and maintain the data layer that feeds GPU-backed technical validation workers — own sharding strategies, manifest generation, throughput optimization, and I/O design so validation compute is never the bottleneck. Understand how data format choices (Parquet, WebDataset, sharded archives) affect GPU-side loading performance.
Airflow DAG Management
Author, debug, and monitor Airflow DAGs for scheduled processing, GPU job orchestration, and pipeline coordination across ingestion, validation, and delivery stages.
QC and Annotation Tooling
Support the FastAPI-backed audio QC portal used by annotation workers; extend data validation and quality-check scripts across egocentric video and audio datasets.
Universal Data Schema (UDS)
Contribute to and enforce the Universal Data Schema for audio, image, and code modalities in the Humyn Labs dataset marketplace — covering schema evolution, versioning, and partition strategies.
Infrastructure and Access Management
Maintain AWS IAM, Lake Formation, and S3 bucket policies; manage data engineer access controls; handle cross-region data movement and vendor data sharing infrastructure.
ETL and Third-Party Integrations
Build and maintain ingestion pipelines from APIs (Twitch, gaming analytics, Google Forms) into DynamoDB and PostgreSQL where required by business context.
You Must Have
Nice to Have
Job ID: 149885695
Skills:
RDS, Power Bi, SQL Server, Tableau, Emr, Informatica, SSIS, Sql, Redshift, ELT, RDBMS, Gcp, Talend, Python, AWS, Etl, Big Query, Marillion, Airflow, cloud-native architecture, Looker, Cloud Data Fusion
Skills:
Workflows, Apis, Cloudformation, Pyspark, Sql, Devops, MLops, Terraform, Spark, Databricks, Azure, Python, AWS, Databricks Lakehouse, dbt
Skills:
Pyspark, Sap Ecc, ELT, AWS, Ml, Odata, SAP BTP, Sql, Devops, Git, Azure Data Factory, Gcp, Databricks, Azure, Etl, Airflow, CI CD, DB Connectors, Databricks Repos, CDC tools, SAP Datasphere, SAP BDC, LLMs, SAP S 4HANA, Ai, Delta Lake, Databricks Genie
Skills:
Power Bi, Power Automate, Data Governance, Data Warehousing Concepts, Power Query, Sql, Azure Sql, Data Visualisation, Azure Data Factory, Data Modelling, Dax, Integration with APIs, Security Implementation, Power Apps, Dataverse, Azure Synapse Fabric, Performance tuning of reports and datasets, Relational Databases, Enterprise analytics architecture design, Azure data services
Skills:
data engineering , snowflake , BigQuery, Data Modelling, Kafka, Redshift, Sql, ELT, Hive, Kinesis, Presto, Docker, Terraform, Spark, Python, Etl, AWS, Airflow, Pubsub, Trino, Delta Lake
We don’t charge any money for job offers