Senior Data Engineer

humyn labs

Bengaluru, India

4-6 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About Humyn Labs

Humyn Labs builds the intelligence layer for physical-world AI — systems that perceive, reason, and act in real environments. Our work sits at the intersection of egocentric video understanding, embodied AI, robotics perception, and voice-driven interaction. We move fast, obsess over data quality, and ship at scale.

Humyn Labs converts human action - across sound, sight, movement, and touch - into high-quality multi-modal data signals for physical AI. Operating across 20+ countries in India, southeast Asia, Latin America, and the Middle East: the real-world environments where physical AI deploys, not the labs where it is built.

Our data isn't just collected; it's evaluated, defended, and production-ready. Because before AI can be trusted, its training data must be.

Role Overview

We are looking for a Senior Data Engineer to own, extend, and harden the multimodal data infrastructure that powers our AI training data supply chain. You will work directly with the Head of Data, building and maintaining production-grade pipelines that move real-world audio, video, and egocentric capture data from ingestion through AWS-scale storage, into GPU-backed technical validation, and finally to delivery-ready datasets for frontier AI lab customers.

This role sits at the intersection of data engineering and ML infrastructure — you will own the full pipeline: ingestion, lake management, quality validation, and compute handoff.

What You Will Work On

Multimodal Ingestion & Preprocessing

Build and maintain pipelines that ingest real-world audio, egocentric video, and image data at scale — covering format normalization, chunking, metadata extraction, and landing into S3-based storage with consistent schema and partitioning.

AWS Data Lake

Manage and extend the KGen data lake: Athena query optimization, Glue crawlers and cataloguing, Apache Hudi table management, Lake Formation column-level permissions, and S3 lifecycle policies at TB–PB scale.

GPU Validation Pipeline Handoff

Design and maintain the data layer that feeds GPU-backed technical validation workers — own sharding strategies, manifest generation, throughput optimization, and I/O design so validation compute is never the bottleneck. Understand how data format choices (Parquet, WebDataset, sharded archives) affect GPU-side loading performance.

Airflow DAG Management

Author, debug, and monitor Airflow DAGs for scheduled processing, GPU job orchestration, and pipeline coordination across ingestion, validation, and delivery stages.

QC and Annotation Tooling

Support the FastAPI-backed audio QC portal used by annotation workers; extend data validation and quality-check scripts across egocentric video and audio datasets.

Universal Data Schema (UDS)

Contribute to and enforce the Universal Data Schema for audio, image, and code modalities in the Humyn Labs dataset marketplace — covering schema evolution, versioning, and partition strategies.

Infrastructure and Access Management

Maintain AWS IAM, Lake Formation, and S3 bucket policies; manage data engineer access controls; handle cross-region data movement and vendor data sharing infrastructure.

ETL and Third-Party Integrations

Build and maintain ingestion pipelines from APIs (Twitch, gaming analytics, Google Forms) into DynamoDB and PostgreSQL where required by business context.

You Must Have

4 to 6 years in a data engineering role with end-to-end pipeline ownership
Strong Python — async patterns, subprocess management, API clients, data processing at scale
Hands-on AWS — Athena, Glue, S3, DynamoDB, Lake Formation; production-grade, not just familiarity
Apache Hudi or Delta Lake — schema evolution, partition strategies, upsert patterns
SQL proficiency — able to write and optimise complex analytical queries
Experience with Airflow or an equivalent workflow orchestrator
Demonstrated experience with large-scale media data pipelines — audio/video format conversion, metadata extraction, chunking, egocentric or multimodal datasets (hard requirement, not a plus)
Understanding of how data format and I/O design affects downstream GPU compute workloads — WebDataset, sharded Parquet, tfrecord, or equivalent.

Nice to Have

Direct experience designing data handoff for GPU clusters (AWS Batch GPU instances, Ray, or SLURM).
Familiarity with ML training data formats and dataset standards used by AI labs (Hugging Face datasets, WebDataset, dataset cards)
Experience with rclone, large-scale file transfer, or cloud-to-cloud sync pipelines
Exposure to data lineage or provenance tooling (OpenLineage, DataHub, or custom metadata schemas).

More Info

Job Type:

Industry:

Function:

Employment Type:

About Company

humyn labsJob Source: www.linkedin.com

Job ID: 149885695

Jobs by Skill - IT

Jobs by Skill - Non IT

5-7 yrs

Bengaluru, India

Skills:

data engineering , snowflake , BigQuery, Data Modelling, Kafka, Redshift, Sql, ELT, Hive, Kinesis, Presto, Docker, Terraform, Spark, Python, Etl, AWS, Airflow, Pubsub, Trino, Delta Lake