About the Company
We are looking for a senior hands-on expert who can take speech systems from raw audio to reliable production features. You will build and improve core speech capabilities such as ASR, TTS, voice conversion, and speech-to-speech workflows, and you will also own the engineering work that makes them fast, scalable, and measurable in the real world.
About the Role
This role is a strong fit if you enjoy the full stack of speech AI: signal processing intuition, modern deep learning, decoding and streaming constraints, and practical deployment trade-offs.
Responsibilities
- Speech modeling that ships
- Build, train, and iterate on ASR models for real-world conditions such as conversational speech, accents, noise, and far-field audio, with strong offline and online evaluation discipline.
- Develop and improve TTS systems that are natural, low-latency, and stable on speaker identity and prosody, with production-quality inference constraints.
- Work on voice conversion and accent conversion when needed, preserving intelligibility, naturalness, and speaker identity in streaming settings.
- Decoder and streaming engineering
- Design and implement decoding stacks using proven libraries and patterns, including Kaldi and OpenFST, and features like custom vocabulary injection, language model rescoring, and beam search tuning.
- Build streaming inference systems with strict latency budgets and predictable behavior at scale, including monitoring and continuous improvement loops.
- Speech analysis and speech intelligence
- Deliver speech analytics building blocks such as VAD, diarization, speaker recognition, and quality analytics that improve end-to-end product outcomes.
- Design robust evaluation harnesses and datasets for real user scenarios, including domain adaptation and behavior tuning across use cases.
- GenAI and LLM integration for voice experiences
- Integrate speech components into LLM-based systems, including cascaded ASR plus LLM plus TTS pipelines, and drive joint optimization where it materially improves product quality.
- Build or extend speech generation capabilities including voice cloning, controllable prosody, and modern generative architectures where relevant to the roadmap.
- Production deployment and operational excellence
- Own end-to-end delivery: prototyping, ablations, training, evaluation, optimization, deployment, and post-launch monitoring.
- Partner closely with product and platform teams to integrate models into real-time systems and maintain reliability, uptime, and quality under production traffic.
Qualifications
- 6+ years building production-grade speech or audio ML systems, or equivalent depth through research plus shipped production impact.
- Strong programming ability in Python, plus comfort in C or C++ for performance-critical components.
- Proven expertise in deep learning for speech (PyTorch or TensorFlow) and practical model training and serving.
- Solid fundamentals in speech and audio, including signal processing concepts and real-world acoustic variability.
- Experience deploying models into real-time or high-throughput systems, including evaluation, scalability, and production reliability.
Required Skills
- Hands-on experience with decoding toolchains and speech customization, including WFST concepts, beam search, and LM rescoring.
- Experience with conversational or telephony speech systems, where latency, robustness, and product polish matter more than leaderboard wins.
- Experience with generative speech systems such as voice cloning, flow matching, diffusion or autoregressive Transformers, and model optimization for real-time inference.
- Familiarity with modern speech stacks and frameworks such as NVIDIA NeMo (or comparable) for ASR and TTS workflows.
- Publications or strong open-source contributions in speech and audio AI.
Pay range and compensation package
Freshers do not apply for the job.
Equal Opportunity Statement
We are committed to diversity and inclusivity.