Guide
Outcome reward models explained
Harbor Support's post-SFT chatbot still produced two failure modes that demonstrations alone could not fix: answers that were technically correct but rude, and answers that were polite but evasive on refund policy. Pairwise annotators ranked 24,000 prompt–completion pairs; engineers trained an outcome reward model (ORM) — a scalar scorer over the full assistant reply — using Bradley-Terry loss on those preferences. The ORM powered a light PPO pass (KL-penalized against the SFT checkpoint) and a production best-of-8 reranker on high-stakes billing prompts. Human win-rate on a frozen eval set rose from 54% to 71%; median response length dropped 18% without a separate brevity penalty. The ORM did not grade reasoning steps — only the final message — which is exactly what outcome supervision means.
An outcome reward model maps a prompt plus complete assistant response to a single real-valued score reflecting human preference. It is the workhorse of classic RLHF: train the scorer on ranked pairs, then optimize the policy with PPO (or use the scorer at inference for reranking). Newer pipelines often skip the explicit ORM via DPO, but ORMs remain essential for best-of-N search, rejection sampling, offline evaluation, and hybrid stacks where a separate judge is cheaper than full policy RL. This guide covers ORM vs process reward models (PRMs), Bradley-Terry training math, architecture and data choices, calibration, overoptimization and reward hacking, PPO integration, inference-time reranking, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
ORM vs PRM: what gets scored
Both are reward models, but the supervision unit differs:
- Outcome reward model (ORM) — one scalar for the entire completion given the prompt. Labels come from pairwise comparisons of full answers (“A is better than B”) or pointwise ratings (1–5 stars mapped to scores). Used in RLHF PPO, best-of-N reranking, and offline ranking benchmarks.
- Process reward model (PRM) — a score at each reasoning step (chain-of-thought line, tool call, or proof step). Catches errors mid-trajectory and powers tree search during inference. Better for math and code where the path matters.
ORMs are simpler to label at scale: annotators read two finished replies and pick a winner. PRMs need step-level correctness labels, which are expensive but reduce “lucky wrong reasoning, right answer” failures. Many production stacks use an ORM for general chat alignment and a PRM only on reasoning-heavy routes.
Bradley-Terry pairwise training
Given prompt x and two completions y_w (winner) and
y_l (loser), the reward model r_\theta is trained so
the preferred completion scores higher. The standard loss is negative log-sigmoid
of the score margin:
L = -log σ( r_θ(x, y_w) - r_θ(x, y_l) )
This is the Bradley-Terry model: the probability that y_w beats
y_l is modeled as a logistic function of their reward difference.
Extensions include:
- Margin loss — require
r_w - r_l > mfor ties and near-ties in noisy data. - Bradley-Terry with ties — three-outcome labels when annotators mark equivalent quality.
- Pointwise regression — MSE on 1–5 ratings; simpler but ignores comparison structure; often worse than pairwise on the same budget.
DPO reparameterizes the same preference structure without an explicit reward head — the ORM and DPO objective are dual views of pairwise alignment. Teams that need a standalone scorer for reranking still train an ORM even when policy updates use DPO.
Architecture: scalar head on a language backbone
Typical ORM design:
- Start from the same family as the policy (e.g. 7B instruct checkpoint) or a smaller encoder for cost.
- Concatenate prompt and completion with the chat template used at inference.
- Run a forward pass; take the hidden state at the last completion token (or mean-pool completion tokens only — ablate both).
- Project through a linear layer to a scalar reward; no softmax — rewards are unbounded reals, though some teams clip at inference.
Initialization matters: init the reward head from the SFT checkpoint so token representations already encode instruction-following semantics. Training only the head for one epoch before unfreezing upper layers reduces early instability. Separate ORM and policy weights — sharing weights and training end-to-end with PPO invites reward hacking through representation collapse.
For long contexts, truncate consistently between train and serve. A reward model that sees 8k tokens at train time but 4k at serve time will mis-rank tail-heavy answers.
Preference data curation
ORM quality is almost entirely data quality. High-signal practices:
- Diverse prompt buckets — safety refusals, factual QA, creative writing, tool JSON, multilingual, edge-case policy. Stratify sampling so rare buckets are not drowned out.
- Controlled pair generation — sample losers from the current policy, an older checkpoint, and a stronger teacher model so pairs span difficulty. All-easy pairs teach little; all-hard pairs add label noise.
- Annotator agreement — discard pairs where two labelers disagree; track per-annotator drift. Inter-rater agreement below 70% usually means ambiguous guidelines, not a model problem.
- Position bias debiasing — swap A/B order across duplicates; drop pairs where winners always appear first.
- Length and sycophancy controls — without explicit debiasing, ORMs learn “longer is better” and “agree with user premise.” Include pairs where concise or corrective answers win.
Harbor Support's refactor added a “policy clarity” bucket: pairs where the winning answer cited the exact refund rule paragraph and the loser used vague corporate language. That bucket alone moved billing-route win-rate 9 points.
Calibration, overoptimization, and reward hacking
A reward model is a proxy for human judgment, not a perfect objective. Known failure modes:
- Overoptimization (Goodhart's law) — PPO pushes the policy past the point where ORM scores rise but human preference falls. Plot validation human win-rate vs training reward; stop when humans plateau while reward keeps climbing.
- Reward hacking — the policy exploits ORM blind spots: markdown headers, fake citations, excessive bullet lists, or repeating the user's question. Red-team the ORM with adversarial completions before PPO.
- Distribution shift — ORM trained on 7B completions mis-scores 70B policy outputs. Refresh preference data from the current policy each RL round (iterative RLHF).
- Calibration drift — raw scores are not probabilities. Platt scaling or isotonic regression on a held-out human-labeled set helps compare scores across prompts for best-of-N thresholds.
Mitigations: KL penalty to the SFT reference during PPO (standard), early stopping on human eval, ensemble two ORMs trained on different data shards, and mixing a fraction of SFT loss (“PPO + SFT”) to anchor fluency.
PPO integration and inference-time reranking
RLHF PPO loop
After ORM training, PPO samples completions from the policy, scores them with the
frozen ORM, and updates policy weights to maximize expected reward minus
\beta \cdot D_{KL}(\pi_\theta \| \pi_{\text{ref}}). Practical details:
- Clip reward at a high percentile to limit outlier gradients.
- Normalize rewards per batch (whitening) for stable advantage estimates.
- Use a critic (value head) on the same backbone or a smaller model.
- Keep learning rate 10–100x smaller than SFT; one to three PPO epochs per data batch is typical.
Best-of-N without policy RL
Cheaper alignment lever: sample N completions at temperature 0.7–1.0,
score each with the ORM, return the top-1 (or top-k for human review). Harbor
Support uses best-of-8 only on billing and fraud prompts where a wrong answer has
regulatory cost; general chit-chat uses greedy decode from the DPO-tuned policy.
Cost scales linearly with N — pair with
speculative decoding
or a smaller draft model when latency matters.
Harbor Support ORM refactor (worked example)
Starting point: Mistral-7B SFT checkpoint (see instruction-tuning guide), 24k pairwise labels from three contractor teams plus 2k gold pairs from internal QA.
- ORM training — 7B backbone, reward head on last token, 2 epochs, lr 1e-5, batch 128 pairs, early stop on validation pairwise accuracy (held-out 2k pairs). Final accuracy 72.4%.
- PPO — 40k prompts, 4 rollouts each, KL coef 0.04, one epoch, stopped when human win-rate plateaued at 71% (reward still climbing at 78th percentile clip).
- Production reranker — best-of-8 on 12% of traffic (billing/fraud); ORM served on a dedicated GPU with 120ms p95 latency budget.
- Monitoring — weekly 500-prompt human audit; alert if ORM-human agreement drops below 65%.
They later added a light DPO pass on fresh preferences, using the PPO checkpoint as reference. The ORM remained in production for reranking because DPO does not expose a per-completion score at inference.
Technique decision table
| Approach | Best when | Trade-off |
|---|---|---|
| Outcome RM + PPO | Large preference set, need iterative policy improvement, classic RLHF stack | Complex infra; overoptimization risk; two models to maintain |
| Outcome RM + best-of-N | High-stakes subset of traffic, policy frozen, latency budget for N samples | Inference cost x N; no weight update |
| DPO / ORPO | Preference data available, want single-stage policy update, no reranker needed | No standalone scalar scorer; harder to combine with search |
| Process RM (PRM) | Math, code, multi-step tool chains; errors in reasoning path matter | Expensive step labels; overkill for short chat replies |
| LLM-as-judge | Rapid prototyping, rubric-based eval, no training budget | Judge bias, cost per call, weaker than trained ORM on domain |
| Constitutional / RLAIF | Scale labels with AI critic; principle-driven refusals | Principle drift; still often ends with ORM or DPO downstream |
Evaluation metrics
- Pairwise accuracy on held-out human labels (target 70%+ for production ORMs on in-domain data).
- Correlation with pointwise ratings (Spearman) on a separate rating set.
- Human win-rate of policy optimized or reranked by the ORM vs baseline — the metric that matters.
- Length-controlled win-rate — regress out length before comparing to catch verbosity hacking.
- Agreement with LLM-as-judge — useful for smoke tests, not a substitute for human eval.
Common pitfalls
- Template mismatch — ORM sees a different chat format than the policy; rankings become random.
- Training on policy-generated-only pairs — ORM cannot score human-written gold; include teacher and human references.
- Ignoring ties and noise — forcing a winner on ambiguous pairs injects label noise; use tie outcomes or soft labels.
- One global ORM for all locales — politeness norms differ; stratify or fine-tune per locale.
- No red-team before PPO — adversarial completions that game the ORM will dominate after RL.
- Confusing ORM accuracy with alignment — 75% pairwise accuracy can still produce harmful winners on safety prompts; bucket eval by risk.
- Skipping iterative data collection — one-shot ORM on static pairs goes stale as the policy moves.
Production checklist
- Define preference dimensions (helpfulness, honesty, harmlessness, brevity) in annotator guidelines.
- Collect stratified pairwise data with position-bias controls and tie handling.
- Match chat template and tokenizer exactly between ORM train and policy serve.
- Init ORM from SFT checkpoint; train reward head then optionally unfreeze layers.
- Track pairwise accuracy and calibration on held-out human labels.
- Red-team ORM with adversarial completions before any PPO.
- PPO with KL to reference; clip rewards; early-stop on human win-rate.
- Deploy best-of-N reranker only where ROI justifies N x inference cost.
- Monitor ORM-human agreement weekly; refresh preference data each RL round.
- Keep DPO or SFT anchor in the stack to limit overoptimization.
Key takeaways
- Outcome reward models assign one scalar score to a full completion, trained mainly via Bradley-Terry pairwise loss.
- They power classic RLHF PPO and inference-time best-of-N reranking when you need an explicit scorer.
- DPO optimizes preferences without a separate ORM, but many stacks keep an ORM for reranking and eval.
- Overoptimization and reward hacking are the main risks — KL penalties, human early-stop, and iterative data refresh are mandatory.
- Harbor Support combined a 72% accurate ORM with light PPO and selective best-of-8 to lift human win-rate from 54% to 71%.
Related reading
- RLHF explained — full alignment pipeline where ORMs sit between SFT and PPO
- DPO explained — preference learning without an explicit reward model
- Process reward models — step-level scoring when the reasoning path matters
- Instruction tuning and SFT — the checkpoint ORMs and PPO build on