Guide

Outcome reward models explained

Harbor Support's post-SFT chatbot still produced two failure modes that demonstrations alone could not fix: answers that were technically correct but rude, and answers that were polite but evasive on refund policy. Pairwise annotators ranked 24,000 prompt–completion pairs; engineers trained an outcome reward model (ORM) — a scalar scorer over the full assistant reply — using Bradley-Terry loss on those preferences. The ORM powered a light PPO pass (KL-penalized against the SFT checkpoint) and a production best-of-8 reranker on high-stakes billing prompts. Human win-rate on a frozen eval set rose from 54% to 71%; median response length dropped 18% without a separate brevity penalty. The ORM did not grade reasoning steps — only the final message — which is exactly what outcome supervision means.

An outcome reward model maps a prompt plus complete assistant response to a single real-valued score reflecting human preference. It is the workhorse of classic RLHF: train the scorer on ranked pairs, then optimize the policy with PPO (or use the scorer at inference for reranking). Newer pipelines often skip the explicit ORM via DPO, but ORMs remain essential for best-of-N search, rejection sampling, offline evaluation, and hybrid stacks where a separate judge is cheaper than full policy RL. This guide covers ORM vs process reward models (PRMs), Bradley-Terry training math, architecture and data choices, calibration, overoptimization and reward hacking, PPO integration, inference-time reranking, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

ORM vs PRM: what gets scored

Both are reward models, but the supervision unit differs:

Outcome reward model (ORM) — one scalar for the entire completion given the prompt. Labels come from pairwise comparisons of full answers (“A is better than B”) or pointwise ratings (1–5 stars mapped to scores). Used in RLHF PPO, best-of-N reranking, and offline ranking benchmarks.
Process reward model (PRM) — a score at each reasoning step (chain-of-thought line, tool call, or proof step). Catches errors mid-trajectory and powers tree search during inference. Better for math and code where the path matters.

ORMs are simpler to label at scale: annotators read two finished replies and pick a winner. PRMs need step-level correctness labels, which are expensive but reduce “lucky wrong reasoning, right answer” failures. Many production stacks use an ORM for general chat alignment and a PRM only on reasoning-heavy routes.

Bradley-Terry pairwise training

Given prompt x and two completions y_w (winner) and y_l (loser), the reward model r_\theta is trained so the preferred completion scores higher. The standard loss is negative log-sigmoid of the score margin:

L = -log σ( r_θ(x, y_w) - r_θ(x, y_l) )

This is the Bradley-Terry model: the probability that y_w beats y_l is modeled as a logistic function of their reward difference. Extensions include:

Margin loss — require r_w - r_l > m for ties and near-ties in noisy data.
Bradley-Terry with ties — three-outcome labels when annotators mark equivalent quality.
Pointwise regression — MSE on 1–5 ratings; simpler but ignores comparison structure; often worse than pairwise on the same budget.

DPO reparameterizes the same preference structure without an explicit reward head — the ORM and DPO objective are dual views of pairwise alignment. Teams that need a standalone scorer for reranking still train an ORM even when policy updates use DPO.

Architecture: scalar head on a language backbone

Typical ORM design:

Start from the same family as the policy (e.g. 7B instruct checkpoint) or a smaller encoder for cost.
Concatenate prompt and completion with the chat template used at inference.
Run a forward pass; take the hidden state at the last completion token (or mean-pool completion tokens only — ablate both).
Project through a linear layer to a scalar reward; no softmax — rewards are unbounded reals, though some teams clip at inference.

Initialization matters: init the reward head from the SFT checkpoint so token representations already encode instruction-following semantics. Training only the head for one epoch before unfreezing upper layers reduces early instability. Separate ORM and policy weights — sharing weights and training end-to-end with PPO invites reward hacking through representation collapse.

For long contexts, truncate consistently between train and serve. A reward model that sees 8k tokens at train time but 4k at serve time will mis-rank tail-heavy answers.

Preference data curation

ORM quality is almost entirely data quality. High-signal practices:

Diverse prompt buckets — safety refusals, factual QA, creative writing, tool JSON, multilingual, edge-case policy. Stratify sampling so rare buckets are not drowned out.
Controlled pair generation — sample losers from the current policy, an older checkpoint, and a stronger teacher model so pairs span difficulty. All-easy pairs teach little; all-hard pairs add label noise.
Annotator agreement — discard pairs where two labelers disagree; track per-annotator drift. Inter-rater agreement below 70% usually means ambiguous guidelines, not a model problem.
Position bias debiasing — swap A/B order across duplicates; drop pairs where winners always appear first.
Length and sycophancy controls — without explicit debiasing, ORMs learn “longer is better” and “agree with user premise.” Include pairs where concise or corrective answers win.

Harbor Support's refactor added a “policy clarity” bucket: pairs where the winning answer cited the exact refund rule paragraph and the loser used vague corporate language. That bucket alone moved billing-route win-rate 9 points.

Calibration, overoptimization, and reward hacking

A reward model is a proxy for human judgment, not a perfect objective. Known failure modes:

Overoptimization (Goodhart's law) — PPO pushes the policy past the point where ORM scores rise but human preference falls. Plot validation human win-rate vs training reward; stop when humans plateau while reward keeps climbing.
Reward hacking — the policy exploits ORM blind spots: markdown headers, fake citations, excessive bullet lists, or repeating the user's question. Red-team the ORM with adversarial completions before PPO.
Distribution shift — ORM trained on 7B completions mis-scores 70B policy outputs. Refresh preference data from the current policy each RL round (iterative RLHF).
Calibration drift — raw scores are not probabilities. Platt scaling or isotonic regression on a held-out human-labeled set helps compare scores across prompts for best-of-N thresholds.

Mitigations: KL penalty to the SFT reference during PPO (standard), early stopping on human eval, ensemble two ORMs trained on different data shards, and mixing a fraction of SFT loss (“PPO + SFT”) to anchor fluency.

PPO integration and inference-time reranking

RLHF PPO loop

After ORM training, PPO samples completions from the policy, scores them with the frozen ORM, and updates policy weights to maximize expected reward minus \beta \cdot D_{KL}(\pi_\theta \| \pi_{\text{ref}}). Practical details:

Clip reward at a high percentile to limit outlier gradients.
Normalize rewards per batch (whitening) for stable advantage estimates.
Use a critic (value head) on the same backbone or a smaller model.
Keep learning rate 10–100x smaller than SFT; one to three PPO epochs per data batch is typical.

Best-of-N without policy RL

Cheaper alignment lever: sample N completions at temperature 0.7–1.0, score each with the ORM, return the top-1 (or top-k for human review). Harbor Support uses best-of-8 only on billing and fraud prompts where a wrong answer has regulatory cost; general chit-chat uses greedy decode from the DPO-tuned policy. Cost scales linearly with N — pair with speculative decoding or a smaller draft model when latency matters.

Harbor Support ORM refactor (worked example)

Starting point: Mistral-7B SFT checkpoint (see instruction-tuning guide), 24k pairwise labels from three contractor teams plus 2k gold pairs from internal QA.

ORM training — 7B backbone, reward head on last token, 2 epochs, lr 1e-5, batch 128 pairs, early stop on validation pairwise accuracy (held-out 2k pairs). Final accuracy 72.4%.
PPO — 40k prompts, 4 rollouts each, KL coef 0.04, one epoch, stopped when human win-rate plateaued at 71% (reward still climbing at 78th percentile clip).
Production reranker — best-of-8 on 12% of traffic (billing/fraud); ORM served on a dedicated GPU with 120ms p95 latency budget.
Monitoring — weekly 500-prompt human audit; alert if ORM-human agreement drops below 65%.

They later added a light DPO pass on fresh preferences, using the PPO checkpoint as reference. The ORM remained in production for reranking because DPO does not expose a per-completion score at inference.

Technique decision table

Approach	Best when	Trade-off
Outcome RM + PPO	Large preference set, need iterative policy improvement, classic RLHF stack	Complex infra; overoptimization risk; two models to maintain
Outcome RM + best-of-N	High-stakes subset of traffic, policy frozen, latency budget for N samples	Inference cost x N; no weight update
DPO / ORPO	Preference data available, want single-stage policy update, no reranker needed	No standalone scalar scorer; harder to combine with search
Process RM (PRM)	Math, code, multi-step tool chains; errors in reasoning path matter	Expensive step labels; overkill for short chat replies
LLM-as-judge	Rapid prototyping, rubric-based eval, no training budget	Judge bias, cost per call, weaker than trained ORM on domain
Constitutional / RLAIF	Scale labels with AI critic; principle-driven refusals	Principle drift; still often ends with ORM or DPO downstream

Evaluation metrics

Pairwise accuracy on held-out human labels (target 70%+ for production ORMs on in-domain data).
Correlation with pointwise ratings (Spearman) on a separate rating set.
Human win-rate of policy optimized or reranked by the ORM vs baseline — the metric that matters.
Length-controlled win-rate — regress out length before comparing to catch verbosity hacking.
Agreement with LLM-as-judge — useful for smoke tests, not a substitute for human eval.

Common pitfalls

Template mismatch — ORM sees a different chat format than the policy; rankings become random.
Training on policy-generated-only pairs — ORM cannot score human-written gold; include teacher and human references.
Ignoring ties and noise — forcing a winner on ambiguous pairs injects label noise; use tie outcomes or soft labels.
One global ORM for all locales — politeness norms differ; stratify or fine-tune per locale.
No red-team before PPO — adversarial completions that game the ORM will dominate after RL.
Confusing ORM accuracy with alignment — 75% pairwise accuracy can still produce harmful winners on safety prompts; bucket eval by risk.
Skipping iterative data collection — one-shot ORM on static pairs goes stale as the policy moves.

Production checklist

Define preference dimensions (helpfulness, honesty, harmlessness, brevity) in annotator guidelines.
Collect stratified pairwise data with position-bias controls and tie handling.
Match chat template and tokenizer exactly between ORM train and policy serve.
Init ORM from SFT checkpoint; train reward head then optionally unfreeze layers.
Track pairwise accuracy and calibration on held-out human labels.
Red-team ORM with adversarial completions before any PPO.
PPO with KL to reference; clip rewards; early-stop on human win-rate.
Deploy best-of-N reranker only where ROI justifies N x inference cost.
Monitor ORM-human agreement weekly; refresh preference data each RL round.
Keep DPO or SFT anchor in the stack to limit overoptimization.

Key takeaways

Outcome reward models assign one scalar score to a full completion, trained mainly via Bradley-Terry pairwise loss.
They power classic RLHF PPO and inference-time best-of-N reranking when you need an explicit scorer.
DPO optimizes preferences without a separate ORM, but many stacks keep an ORM for reranking and eval.
Overoptimization and reward hacking are the main risks — KL penalties, human early-stop, and iterative data refresh are mandatory.
Harbor Support combined a 72% accurate ORM with light PPO and selective best-of-8 to lift human win-rate from 54% to 71%.