Senior LLM Engineer – RLHF & Alignment

innodata india private limited

Noida, India

5-8 Years

Save

Posted 14 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Role- Senior LLM Engineer – RLHF & Alignment

Experience - 5-8 years

Job mode- Hybrid (Noida)

Job Description:

Own and drive the full RLHF pipeline: data collection, reward model training, and RL fine-tuning using PPO, DPO, GRPO, and RLAIF
Design and run Supervised Fine-Tuning (SFT) pipelines on open-weight models (LLaMA, Mistral, Qwen) as the foundation for RLHF
Build and train reward models that accurately capture human preferences from annotation data
Design human feedback collection pipelines: labeling rubrics, annotator calibration, and preference dataset curation
Implement Constitutional AI and RLAIF techniques to reduce reliance on costly human annotation
Red team models post-training — probing for jailbreaks, regressions, unsafe outputs, and alignment failures
Design and maintain evaluation benchmarks to measure alignment, safety, and capability before and after RL training
Optimize inference pipelines and runtimes (llama.cpp, vLLM, TensorRT) to serve aligned models efficiently at scale
Implement quantization strategies (INT4/INT8/FP8, LoRA, QLoRA) to deploy fine-tuned models on target hardware
Write and tune low-level C/C++ and Rust code for inference performance where Python cannot reach
Diagnose and resolve training instabilities, reward hacking, and production inference bugs under pressure
Stay at the frontier — read alignment and RL papers weekly and translate findings into working experiments

Core Requirements and Technical Skills

Hands-on experience implementing RLHF end-to-end — not just using libraries, but understanding the mechanics
Deep familiarity with policy gradient methods: PPO stability, KL divergence constraints, reward shaping
Experience with Direct Preference Optimization (DPO) and its variants as an RLHF alternative
Understanding of reward hacking, Goodhart's Law, and mitigation strategies in RL training
Familiarity with RLAIF (RL from AI Feedback) and Constitutional AI approaches
Ability to design preference datasets and annotation rubrics that produce high-quality reward signal
Experience diagnosing training instabilities: reward collapse, mode collapse, KL divergence blowup
Python as the primary language for all training, fine-tuning, and evaluation pipelines
Strong mathematical foundation: RL theory, probability, linear algebra, optimization — deep enough to derive loss functions and debug training dynamics
C and C++ for systems-level inference work, runtime contributions, and performance-critical paths
Rust experience with ML tooling.
Familiarity with transformer architecture, attention, tokenization, and how post-training interacts with pretraining
Experience with distributed training frameworks for large-scale fine-tuning
Experience with vector databases such as FAISS or Milvus
Familiarity with retrieval-augmented generation (RAG) pipelines
Experience integrating LLMs with external tools, APIs, and agent-based systems
Exposure to Rapid Application Development (RAD) approaches for building and iterating AI solutions efficiently