Search by job, company or skills

innodata india private limited

Senior LLM Engineer – RLHF & Alignment

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 14 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role- Senior LLM Engineer – RLHF & Alignment

Experience - 5-8 years

Job mode- Hybrid (Noida)

Job Description:

  • Own and drive the full RLHF pipeline: data collection, reward model training, and RL fine-tuning using PPO, DPO, GRPO, and RLAIF
  • Design and run Supervised Fine-Tuning (SFT) pipelines on open-weight models (LLaMA, Mistral, Qwen) as the foundation for RLHF
  • Build and train reward models that accurately capture human preferences from annotation data
  • Design human feedback collection pipelines: labeling rubrics, annotator calibration, and preference dataset curation
  • Implement Constitutional AI and RLAIF techniques to reduce reliance on costly human annotation
  • Red team models post-training — probing for jailbreaks, regressions, unsafe outputs, and alignment failures
  • Design and maintain evaluation benchmarks to measure alignment, safety, and capability before and after RL training
  • Optimize inference pipelines and runtimes (llama.cpp, vLLM, TensorRT) to serve aligned models efficiently at scale
  • Implement quantization strategies (INT4/INT8/FP8, LoRA, QLoRA) to deploy fine-tuned models on target hardware
  • Write and tune low-level C/C++ and Rust code for inference performance where Python cannot reach
  • Diagnose and resolve training instabilities, reward hacking, and production inference bugs under pressure
  • Stay at the frontier — read alignment and RL papers weekly and translate findings into working experiments

Core Requirements and Technical Skills

  • Hands-on experience implementing RLHF end-to-end — not just using libraries, but understanding the mechanics
  • Deep familiarity with policy gradient methods: PPO stability, KL divergence constraints, reward shaping
  • Experience with Direct Preference Optimization (DPO) and its variants as an RLHF alternative
  • Understanding of reward hacking, Goodhart's Law, and mitigation strategies in RL training
  • Familiarity with RLAIF (RL from AI Feedback) and Constitutional AI approaches
  • Ability to design preference datasets and annotation rubrics that produce high-quality reward signal
  • Experience diagnosing training instabilities: reward collapse, mode collapse, KL divergence blowup
  • Python as the primary language for all training, fine-tuning, and evaluation pipelines
  • Strong mathematical foundation: RL theory, probability, linear algebra, optimization — deep enough to derive loss functions and debug training dynamics
  • C and C++ for systems-level inference work, runtime contributions, and performance-critical paths
  • Rust experience with ML tooling.
  • Familiarity with transformer architecture, attention, tokenization, and how post-training interacts with pretraining
  • Experience with distributed training frameworks for large-scale fine-tuning
  • Experience with vector databases such as FAISS or Milvus
  • Familiarity with retrieval-augmented generation (RAG) pipelines
  • Experience integrating LLMs with external tools, APIs, and agent-based systems
  • Exposure to Rapid Application Development (RAD) approaches for building and iterating AI solutions efficiently

More Info

Job Type:
Industry:
Employment Type:

Job ID: 146438607

Similar Jobs