Research Data Engineer (India) — Smallest.ai
About The Role
This is not a typical data engineering role. You won't be building dashboards. You won't be maintaining pipelines no one touches.
You will take messy, noisy, real-world data — and turn it into something models can learn from. Think of this as running a gold mine - you take dust and convert it to gold.
We work on speech, language, and real-time systems across 50+ languages.
The difference between a good model and a great one is almost always data quality + data systems. That's where you come in.
What You'll Work On
- Data Pipelines (Real-time + Batch)
- Build high-throughput pipelines for audio, text, and multimodal data
- Streaming + offline processing at scale
- Data Quality & Curation
- Cleaning, filtering, deduplication, normalization (numbers, emails, code-mix, etc.)
- Designing heuristics + ML-based data filtering systems
- Multilingual Data Systems
- Handling 50+ languages, accents, and code-mixed inputs
- Language-aware normalization and segmentation
- Training Data Engine
- Build pipelines that continuously generate better training data from production
- Active learning loops, data selection, sampling strategies
- Evaluation & Benchmarking Pipelines
- Create scalable eval datasets across languages and domains
- Automate quality tracking for ASR, TTS, and conversational systems
- Data Infra for Research
- Work closely with research team to unblock experiments fast
- Build systems that reduce iteration time from weeks → hours
What This Role Is NOT
- Not a dashboard/reporting role
- Not a move data from A to B role
- Not a maintenance-heavy legacy pipeline role
What We're Looking For
- Strong fundamentals in data structures, systems, and pipelines
- Experience with large-scale data processing (audio/text preferred)
- Comfortable with messy, unstructured, real-world data
- Strong coding skills (Python required; systems experience is a plus)
- Understanding of ML/data pipelines (training, eval, data curation)
Bonus (Not Mandatory)
- Experience with speech/audio data (ASR/TTS)
- Familiarity with multilingual datasets
- Experience with streaming systems (Kafka, etc.)
- Exposure to data-centric AI / data quality frameworks
How We Work
- Speed over perfection
- Production over papers
- Systems that scale, not scripts that barely work
- Tight loop between data → model → eval → improvement
Who This Is For
- You enjoy working with raw, chaotic data
- You care about data quality more than tooling hype
- You like building systems that directly impact model performance
- You get excited by turning unusable data into competitive advantage
Why Join Us
We're building real-time, multilingual voice AI systems.
At this level,
models are only as good as the data behind them.
If you want to work on the layer that actually moves the needle - this is it.