RLHF — Training LLMs to Follow Instructions

Pretrained LLMs complete text — they don't follow instructions. RLHF (Reinforcement Learning from Human Feedback) turns a text completer into an assistant. Three stages: supervised fine-tuning on demonstrations, reward model training from human preference pairs, and PPO reinforcement learning to optimize for the reward model.

25 min•By Priygop Team•Updated 2026

RLHF — Three-Stage Process

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# RLHF (Reinforcement Learning from Human Feedback)
# Used by: ChatGPT, Claude, Gemini, LLaMA-2-Chat
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# ── STAGE 1: SUPERVISED FINE-TUNING (SFT) ─────────────
# Take base pretrained model (GPT-4 base, LLaMA base)
# Fine-tune on high-quality demonstration data:
# Input: User instruction
# Output: Expert-written ideal response
# Result: Model learns to RESPOND like an assistant, not just COMPLETE text

sft_examples = [
    {
        "instruction": "Explain quantum computing in simple terms.",
        "response": "Quantum computing uses quantum mechanical phenomena like superposition..."
    },
    {
        "instruction": "Write a Python function to reverse a string.",
        "response": "def reverse_string(s: str) -> str:\n    return s[::-1]"
    },
]

# ── STAGE 2: REWARD MODEL TRAINING ────────────────────
# Humans compare pairs of responses for the SAME prompt
# Annotator picks: "Response A is better than Response B"
# Train a classifier to predict which response humans prefer
# Output: reward score R ∈ ℝ (higher = more preferable)

preference_pairs = [
    {
        "prompt":    "What is 2+2?",
        "chosen":    "2+2 equals 4.",                              # humans preferred
        "rejected":  "The answer, after careful consideration of arithmetic first principles, is indeed 4.",  # verbose, less preferred
    },
    {
        "prompt":    "How do I pick a lock?",
        "chosen":    "I can't help with that as it may facilitate illegal activity.",  # safe
        "rejected":  "Step 1: Insert tension wrench...",           # unsafe, refused
    },
]

# IMPLEMENTATION with TRL (HuggingFace Transformers RL library)
# pip install trl
from trl import RewardTrainer, RewardConfig
# reward_model = RewardTrainer(model, train_dataset=preference_data)

# ── STAGE 3: RL with PPO (Proximal Policy Optimization) ──
# Use the trained Reward Model as the "reward function"
# Fine-tune the SFT model to maximize expected reward
# KL divergence penalty prevents the model from drifting too far from SFT baseline

# PPO loop (conceptual):
# 1. Sample prompt from dataset
# 2. Generate response with current policy model
# 3. Score response with reward model
# 4. Compute KL(policy || reference_SFT)
# 5. reward_adj = reward - beta * KL_penalty  (prevents reward hacking)
# 6. Update policy using PPO gradient update

from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
    model_name="sft_model",
    learning_rate=1.41e-5,
    batch_size=128,
    mini_batch_size=16,
    gradient_accumulation_steps=4,
    optimize_cuda_cache=True,
    ppo_epochs=4,
    kl_penalty="kl",    # penalize large policy deviations
    init_kl_coef=0.2,   # initial KL coefficient
    adap_kl_ctrl=True,  # adaptive KL control
)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# DPO — Direct Preference Optimization (simpler alternative)
# Skips explicit reward model training and RL loop
# Directly optimizes on preference pairs using a clever loss
# Used by: LLaMA-3-Instruct, Zephyr, many open-source models
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig(
    beta=0.1,              # temperature for DPO loss
    learning_rate=5e-4,
    num_train_epochs=3,
)
# dpo_trainer = DPOTrainer(model, ref_model, config=dpo_config, train_dataset=preference_data)
# dpo_trainer.train()

# DPO loss:
# L = -E[log σ(β * log(π_θ(y_w|x)/π_ref(y_w|x)) - β * log(π_θ(y_l|x)/π_ref(y_l|x)))]
# Maximizes likelihood of chosen (y_w) relative to rejected (y_l) vs reference model

Tip

Practice RLHF Training LLMs to Follow Instructions in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Practice Task

Note

Practice Task — (1) Write a working example of RLHF Training LLMs to Follow Instructions from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with RLHF Training LLMs to Follow Instructions is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module