Search by job, company or skills

smallest.ai

Research Data Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 12 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Research Data Engineer (India) — Smallest.ai

About The Role

This is not a typical data engineering role. You won't be building dashboards. You won't be maintaining pipelines no one touches.

You will take messy, noisy, real-world data — and turn it into something models can learn from. Think of this as running a gold mine - you take dust and convert it to gold.

We work on speech, language, and real-time systems across 50+ languages.

The difference between a good model and a great one is almost always data quality + data systems. That's where you come in.

What You'll Work On

  • Data Pipelines (Real-time + Batch)
    • Build high-throughput pipelines for audio, text, and multimodal data
    • Streaming + offline processing at scale
  • Data Quality & Curation
    • Cleaning, filtering, deduplication, normalization (numbers, emails, code-mix, etc.)
    • Designing heuristics + ML-based data filtering systems
  • Multilingual Data Systems
    • Handling 50+ languages, accents, and code-mixed inputs
    • Language-aware normalization and segmentation
  • Training Data Engine
    • Build pipelines that continuously generate better training data from production
    • Active learning loops, data selection, sampling strategies
  • Evaluation & Benchmarking Pipelines
    • Create scalable eval datasets across languages and domains
    • Automate quality tracking for ASR, TTS, and conversational systems
  • Data Infra for Research
    • Work closely with research team to unblock experiments fast
    • Build systems that reduce iteration time from weeks → hours
What This Role Is NOT

  • Not a dashboard/reporting role
  • Not a move data from A to B role
  • Not a maintenance-heavy legacy pipeline role

What We're Looking For

  • Strong fundamentals in data structures, systems, and pipelines
  • Experience with large-scale data processing (audio/text preferred)
  • Comfortable with messy, unstructured, real-world data
  • Strong coding skills (Python required; systems experience is a plus)
  • Understanding of ML/data pipelines (training, eval, data curation)

Bonus (Not Mandatory)

  • Experience with speech/audio data (ASR/TTS)
  • Familiarity with multilingual datasets
  • Experience with streaming systems (Kafka, etc.)
  • Exposure to data-centric AI / data quality frameworks

How We Work

  • Speed over perfection
  • Production over papers
  • Systems that scale, not scripts that barely work
  • Tight loop between data → model → eval → improvement

Who This Is For

  • You enjoy working with raw, chaotic data
  • You care about data quality more than tooling hype
  • You like building systems that directly impact model performance
  • You get excited by turning unusable data into competitive advantage

Why Join Us

We're building real-time, multilingual voice AI systems.

At this level, models are only as good as the data behind them.

If you want to work on the layer that actually moves the needle - this is it.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 145601335

Similar Jobs