Senior Speech & Audio Engineer

neuranx.ai

Chennai, India

5-7 Years

Save

Posted 16 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About the role

You own everything from raw courtroom audio to speaker-labeled segments and the speech recognition that transcribes them. This is a hybrid role combining ASR fine-tuning and speaker diarization — two domains. You are responsible for the core speech pipeline.

You will own the core ASR engine for an on-premise legal transcription system: fine-tuning Qwen3-ASR 1.7B on legal domain audio, building the LoRA training pipeline, implementing pseudo-labeling for semi supervised learning, integrating the Cohere Transcribe fallback, and developing the confidence scoring system. This is the highest-impact IC role — your model's accuracy directly determines whether the system meets its ≥95% draft accuracy target. You report to the Director of AI, who sets architecture and quality gates. You own implementation and accuracy.

You will own everything between raw court audio and speaker-labeled segments: VAD, denoising, speaker diarization, speaker enrollment, role classification, and audio data processing. Your output is the foundation every downstream component builds on — if speakers are wrong or segments are misaligned, the transcript is wrong regardless of ASR accuracy. You are the team's audio domain expert.

What you'll do

ASR (50%):

ASR fine-tuning: LoRA fine-tune Qwen3-ASR-1.7B on 50-200 hours of labeled legal audio. Design data mixing strategy (legal 70%, public legal 20%, general 10%). Implement the semi-supervised noisy student pipeline (pseudo-label unlabeled audio, filter by quality heuristics, retrain iteratively).
Forced alignment: Integrate Qwen3-ForcedAligner-0.6B to produce word-level timestamps on every transcribed segment.
Ensemble/fallback: Build confidence-based routing — run low-confidence segments through Cohere Transcribe, select the hypothesis with higher domain-LM score.
Confidence scoring: Implement the ConfidenceAnalyzer (speech-rate anomaly detection, repetition detection, LM perplexity flagging) and the triage logic (HIGH/MED/LOW → auto-accept/flag/must review).
LM integration: Integrate shallow fusion (KenLM 4-gram from NLP Engineer) and neural LM rescoring into the beam search pipeline. Collaborate with NLP Engineer on hot-word biasing via the HotWordLogitsProcessor
Accuracy ownership: Track WER, legal-term WER, proper-noun WER across training iterations. Present accuracy reports to the Director for quality gate decisions.
Model versioning: LoRA adapter checkpoints with metadata (training data hash, eval metrics, date). Coordinate with Platform Engineer on model registry.Fine-tune Qwen3-ASR-1.7B using LoRA/PEFT on legal domain audio
Implement beam search decoding, language model fusion, and confidence scoring
Measure and optimize Word Error Rate (target ≤5% overall, ≤8% on legal terms)
Audio preprocessing: Build the intake pipeline — Silero VAD, DeepFilterNet/RNNoise for denoising, channel separation for multi-mic courtroom setups, resampling to 16kHz mono.
Speaker diarization: Configure and tune pyannote.audio 3.x for courtroom audio — clustering threshold, minimum segment duration, overlap detection. Target: ≤10% DER.
Speaker enrollment: Implement enrollment-based speaker identification — capture 30-60s enrollment per courtroom role, match diarization clusters to enrolled speakers via cosine similarity on ECAPA-TDNN/wespeaker embeddings.
Role classification: Build the turn-pattern-based classifier (utterance duration, question ratio, authority markers → MLP/XGBoost) for sessions without enrollment. Collaborate with NLP Engineer on DeBERTa-based classifier when text features are available.
Speaker consistency: Solve cross-session and cross-window speaker merging for multi-hour hearings — embedding similarity matching with running centroid updates. Page 2 of 7 Legal ASR System — Job Descriptions
Overlap handling: Tune pyannote's powerset-based overlap detection for courtroom cross-talk (objections during testimony). Flag overlap regions for human review.
Audio data pipeline: Process sourced audio (SCOTUS, C-SPAN, client recordings) into training-ready format — segmentation, speaker labeling, format normalization. Work with Data Engineer on audio annotation.
Edge case robustness: Handle variable courtroom acoustics (reverb, HVAC noise, phone/video testimony, masked speech), speaker drift over long sessions, very short utterances (clerk/bailiff).
Evaluation: Measure DER, speaker confusion rate, role assignment accuracy. Report to Director for quality gate reviews
Work with Data Engineer on training data curation

Speaker diarization & audio processing (40%):

Build VAD, denoising, and resampling pipeline (16kHz mono)
Configure pyannote.audio 3.x for courtroom audio (target DER ≤10%)
Implement speaker enrollment and role classification (ECAPA-TDNN embeddings)
Handle overlap detection, speaker drift, and variable acoustics

Integration (10%):

Collaborate with NLP Engineer on hot-word boosting and LM fusion
Support production deployment of speech models

What you bring

5+ years in ML engineering with both ASR and speaker diarization experience
Hands-on fine-tuning of large speech models (Whisper, Qwen, Conformer, or similar)
Hands-on diarization experience (pyannote.audio, NeMo, or custom clustering)
PyTorch, HuggingFace, PEFT/LoRA proficiency
Experience with beam search decoding, language model fusion, and N-best rescoring Familiarity with ASR evaluation (WER, CER, word-level analysis with jiwer or similar) Comfortable training on multi-GPU setups (A100/A10G) Strong software engineering: testable code, version control discipline, reproducible experiments
Hands-on speaker diarization experience: pyannote.audio, NeMo diarization, or custom clustering pipelines Strong with speaker embeddings: x-vectors, ECAPA-TDNN, wespeaker Audio signal processing fundamentals: resampling, spectral analysis, noise reduction, VAD Experience working with imperfect real-world audio (background noise, reverberation, variable mic quality)
Audio signal processing fundamentals (noise reduction, VAD, resampling)
Strong Python engineering and experiment tracking
Self-directed — you are the sole speech expert on the team

Nice to have

Experience with Qwen3-ASR or the qwen-asr Python package
Prior domain adaptation for ASR (legal, medical, financial)
Experience with CTC forced alignment or confidence estimation for generative models Familiarity with vLLM or model serving frameworks
pyannote.audio 3.x (powerset training, overlap detection)
Legal or courtroom audio experience
Multi-GPU training (A100s)