Speech Language Specialist

HCLTech

Noida, India

6-8 Years

Save

Posted 6 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Title: Speech AI Engineer

Location : Noida

Yrs of exp : 6+ years

Role summary

We are looking for a senior hands-on expert who can take speech systems from raw audio to reliable production features. You will build and improve core speech capabilities such as ASR, TTS, voice conversion, and speech-to-speech workflows, and you will also own the engineering work that makes them fast, scalable, and measurable in the real world.

This role is a strong fit if you enjoy the full stack of speech AI: signal processing intuition, modern deep learning, decoding and streaming constraints, and practical deployment trade-offs.

What you will own

1) Speech modeling that ships

Build, train, and iterate on ASR models for real-world conditions such as conversational speech, accents, noise, and far-field audio, with strong offline and online evaluation discipline.
Develop and improve TTS systems that are natural, low-latency, and stable on speaker identity and prosody, with production-quality inference constraints.
Work on voice conversion and accent conversion when needed, preserving intelligibility, naturalness, and speaker identity in streaming settings.

2) Decoder and streaming engineering

Design and implement decoding stacks using proven libraries and patterns, including Kaldi and OpenFST, and features like custom vocabulary injection, language model rescoring, and beam search tuning.
Build streaming inference systems with strict latency budgets and predictable behavior at scale, including monitoring and continuous improvement loops.

3) Speech analysis and speech intelligence

Deliver speech analytics building blocks such as VAD, diarization, speaker recognition, and quality analytics that improve end-to-end product outcomes.
Design robust evaluation harnesses and datasets for real user scenarios, including domain adaptation and behavior tuning across use cases.

4) GenAI and LLM integration for voice experiences

Integrate speech components into LLM-based systems, including cascaded ASR plus LLM plus TTS pipelines, and drive joint optimization where it materially improves product quality.
Build or extend speech generation capabilities including voice cloning, controllable prosody, and modern generative architectures where relevant to the roadmap.

5) Production deployment and operational excellence

Own end-to-end delivery: prototyping, ablations, training, evaluation, optimization, deployment, and post-launch monitoring.
Partner closely with product and platform teams to integrate models into real-time systems and maintain reliability, uptime, and quality under production traffic.

Required qualifications

6+ years building production-grade speech or audio ML systems, or equivalent depth through research plus shipped production impact.
Strong programming ability in Python, plus comfort in C or C++ for performance-critical components.
Proven expertise in deep learning for speech (PyTorch or TensorFlow) and practical model training and serving.
Solid fundamentals in speech and audio, including signal processing concepts and real-world acoustic variability.
Experience deploying models into real-time or high-throughput systems, including evaluation, scalability, and production reliability.

Strongly preferred

Hands-on experience with decoding toolchains and speech customization, including WFST concepts, beam search, and LM rescoring.
Experience with conversational or telephony speech systems, where latency, robustness, and product polish matter more than leaderboard wins.
Experience with generative speech systems such as voice cloning, flow matching, diffusion or autoregressive Transformers, and model optimization for real-time inference.
Familiarity with modern speech stacks and frameworks such as NVIDIA NeMo (or comparable) for ASR and TTS workflows.
Publications or strong open-source contributions in speech and audio AI.

Freshers kindly do not apply for the job