Guide

RLHF explained

A base language model trained on internet text predicts plausible continuations — not necessarily helpful, honest, or safe ones. It may complete harmful instructions, hallucinate citations, or ramble past the point a human would stop. Reinforcement learning from human feedback (RLHF) is the dominant post-training pipeline that steers models toward behaviors people actually want. Popularized by InstructGPT and ChatGPT, RLHF combines supervised demonstrations, pairwise human preferences, a learned reward model, and policy optimization (classically PPO) to align outputs with human judgment. Understanding RLHF explains why chatbots feel different from raw completions, why alignment is expensive, and why newer methods like DPO exist.

Why pretraining is not enough

Pretraining optimizes next-token likelihood on massive corpora. That objective rewards statistical mimicry: the model learns grammar, facts, coding patterns, and toxic rhetoric with equal indifference. A prompt like "write a phishing email" is a valid continuation task from the model's perspective.

Alignment layers a different objective on top: behave like a useful assistant under a system prompt. That requires teaching refusals, concise answers, instruction following, tone control, and calibrated uncertainty — properties weakly correlated with raw perplexity. RLHF (and related preference-learning methods) encode these goals through human judgments rather than hand-written loss functions for every edge case.

The result is not perfect safety — aligned models still hallucinate and remain vulnerable to prompt injection — but RLHF dramatically shifts the default behavior distribution toward assistant-like responses.

The classic three-stage RLHF pipeline

Production RLHF at frontier labs typically runs three sequential stages. Each builds on the last; skipping steps usually degrades final quality.

Stage 1: Supervised fine-tuning (SFT)

Human contractors write ideal responses to diverse prompts — summaries, coding tasks, refusals, multi-turn dialogues. The pretrained model is fine-tuned on these (prompt, response) pairs with standard cross-entropy loss, the same mechanism described in our LLM fine-tuning guide. SFT teaches the model the format of being an assistant: follow instructions, use markdown, decline harmful requests politely.

SFT alone plateaus quickly. Contractors cannot cover every prompt distribution, and imitation learning averages over demonstrator quality — mediocre examples pull the policy toward mediocrity. That is where preferences enter.

Stage 2: Reward model training

For the same prompts, annotators compare two or more model-generated responses and pick the better one. These pairwise preferences train a separate reward model (often initialized from the SFT checkpoint) to assign a scalar score to any (prompt, response) pair.

A common formulation is the Bradley-Terry model: the probability response A beats B is modeled as a logistic function of the score difference. Training maximizes the likelihood of observed human rankings. The reward model becomes a differentiable proxy for "would a human prefer this?" — cheaper to query at scale than live annotators, though it inherits annotator biases and blind spots.

Stage 3: Reinforcement learning with PPO

The SFT model is the policy. For each prompt, the policy samples a response; the reward model scores it; the policy updates to increase expected reward. In practice this uses proximal policy optimization (PPO), a stable policy-gradient algorithm from reinforcement learning.

A critical detail: a KL divergence penalty anchors the policy to the SFT checkpoint. Without it, the model "games" the reward model — producing high-scoring but incoherent text (reward hacking). The KL term trades alignment gain against catastrophic drift from fluent language.

PPO on billions of parameters is engineering-heavy: value networks, advantage estimation, distributed rollout workers, and careful hyperparameter tuning. That complexity motivated simpler preference-optimization alternatives.

Preference data: what humans actually label

RLHF quality is bounded by preference data quality. Typical annotation guidelines rank responses on:

Helpfulness — does it solve the user's task?
Honesty — does it admit uncertainty instead of inventing facts?
Harmlessness — does it refuse dangerous instructions without being preachy?

Annotators are not uniform. Cultural context, expertise, and task difficulty shift rankings. Labs aggregate many labelers, use majority vote or learned annotator models, and continuously refresh datasets as failure modes emerge (jailbreaks, sycophancy, over-refusal).

Pairwise comparison is easier for humans than absolute 1–10 scoring — people are reliable at "which is better?" but poor at consistent cardinal ratings. The reward model learns from these relative judgments.

Beyond PPO: DPO, RLAIF, and constitutional AI

RLHF's PPO stage is expensive and unstable. Research and production teams increasingly adopt methods that optimize preferences directly without a separate reward model and RL loop.

Direct Preference Optimization (DPO)

DPO reparameterizes the RLHF objective so the optimal policy can be learned with a simple classification loss on preference pairs — no reward model, no PPO rollouts. You fine-tune the language model so preferred responses have higher likelihood than rejected ones, with an implicit KL constraint baked into the loss. DPO is cheaper, more reproducible, and often matches PPO quality on mid-scale experiments.

RLAIF (AI feedback)

When human labeling bottlenecks, a stronger model can rank outputs from a weaker one — reinforcement learning from AI feedback. Constitutional AI (Anthropic's approach) goes further: models critique and revise their own drafts against written principles before preference learning. AI feedback scales faster but risks amplifying the teacher model's blind spots unless grounded with human spot-checks.

Process supervision and outcome supervision

For reasoning tasks (math, code), rewarding only the final answer encourages lucky guesses. Process supervision labels each reasoning step, training models to produce verifiable chains. Outcome-only RLHF works for chat tone; process labels matter when intermediate errors compound.

Failure modes alignment does not fix

RLHF shifts distributions; it does not grant understanding or ground truth. Common post-alignment failures:

Sycophancy — the model agrees with mistaken user premises because agreeable responses scored well in training.
Reward hacking — verbose, confident, or formatted answers beat concise correct ones if annotators conflate length with quality.
Over-refusal — safety tuning makes the model decline benign requests that resemble blocked categories.
Shallow preferences — the reward model scores surface politeness but misses factual errors or subtle harms.
Distribution shift — behaviors at deployment (tool use, long agent loops) differ from static chat prompts in training data.

Mitigations layer on top of RLHF: red-teaming, adversarial training, retrieval grounding, tool-verified answers, runtime guardrails, and continuous eval suites — not a single alignment pass.

RLHF vs other post-training choices

Teams building on frontier APIs rarely run full RLHF themselves. The decision tree for custom models:

Prompting only — cheapest; sufficient when base models are already instruction-tuned.
SFT on domain data — teaches vocabulary and task format; best ROI for specialized assistants (legal, medical summaries with disclaimers).
RLHF / DPO — when you need preference shaping beyond demonstrations (tone, safety tradeoffs, ranking multiple styles).
RAG — grounds answers in your documents; orthogonal to RLHF, often combined.

Open-weight models (Llama, Mistral, Qwen) ship with varying degrees of SFT and preference tuning. Fine-tuning an already-aligned checkpoint with DPO on your users' thumbs-up/down data is the most accessible "mini-RLHF" for product teams.

Production checklist for builders

Know what your base model already absorbed — re-running alignment you cannot afford is wasteful; audit vendor safety cards.
Collect real preference signals — implicit feedback (edits, regenerations, copy clicks) beats synthetic rankings when labeled carefully.
Evaluate on task outcomes — not just "sounds good"; use held-out prompts with verifiable answers where possible.
Monitor sycophancy and over-refusal — track refusal rates by category; sample conversations where users correct the model.
Pair alignment with grounding — RLHF does not replace RAG or citation requirements for factual products.
Assume jailbreaks will evolve — alignment raises the bar; runtime filters and tool sandboxing still matter for agentic apps.
Version your datasets — preference drift (new policies, new failure modes) requires retraining, not one-time alignment.

Key takeaways

RLHF aligns pretrained LLMs with human preferences through SFT, reward modeling, and policy optimization.
Pairwise comparisons train reward models that proxy human judgments at scale.
PPO with a KL penalty prevents the policy from gaming the reward model into gibberish.
DPO and RLAIF simplify the pipeline — preference learning without full RL rollouts.
Alignment reduces but does not eliminate hallucinations, injection attacks, and sycophancy.
Most product teams consume alignment via API vendors or open weights; custom DPO on user feedback is the practical on-ramp.