Guide
Process reward models explained
A student asks Harbor Analytics’ math tutor: “If 3x + 7 = 22, what is x?” The model writes four steps, gets the right answer (x = 5), but step two illegally divides both sides by 3 before isolating the constant term. An outcome reward model (ORM) grades only the final line — correct, full reward. A process reward model (PRM) scores each intermediate step and flags step two as invalid algebra even though the lucky arithmetic still lands on 5. That distinction is why PRMs matter for chain-of-thought reasoning: they provide dense supervision where errors actually occur, enable search over partial solutions at inference time, and complement outcome-only methods like GRPO that reward correct finals without teaching valid intermediate logic. OpenAI’s process supervision work (2023) and the reasoning-model wave (o1, DeepSeek-R1) put PRMs back at the center of the alignment stack. This guide explains PRM vs ORM architecture, how to collect step labels, using PRMs for best-of-N and tree search, training with process supervision in reinforcement learning, a Harbor Analytics tutor worked example, a method decision table, common pitfalls, and a production checklist.
Outcome vs process reward models
Both are small classifiers (often fine-tuned from the same base LLM family) that output a scalar score. The difference is what they grade.
Outcome reward model (ORM)
- Input — full prompt plus complete model response (all reasoning steps and final answer).
- Output — single score: how good is this entire completion?
- Training signal — human preference pairs, correctness labels on finals, or RLHF rankings.
- Strength — cheap to deploy when only the answer matters (classification, short QA).
- Weakness — cannot distinguish a correct answer reached by valid logic from a lucky guess or hidden error.
Process reward model (PRM)
- Input — prompt plus a prefix of the reasoning trace up to step k.
- Output — score for whether step k is correct/valid given prior steps.
- Training signal — human annotators label each step positive (+) or negative (−); or synthetic labels from verifiers.
- Strength — localizes errors; supports beam search and MCTS over reasoning trees; reduces reward hacking on finals alone.
- Weakness — expensive labels; step segmentation ambiguity; calibration drift on out-of-distribution reasoning styles.
Think of ORM as grading the essay’s conclusion. PRM grades every paragraph. For STEM tasks where intermediate mistakes invalidate the reasoning even when arithmetic recovers, PRM supervision is strictly more informative.
How PRMs are trained
A PRM is typically a language model with a classification head (or next-token
prediction of +/− step markers) trained on
step-labeled chain-of-thought data.
Label collection strategies
- Human step labels — annotators mark each line of a worked solution as correct or incorrect; gold standard but slow ($1–5 per problem at scale).
- Model-assisted labeling — strong teacher model proposes steps; humans correct only disagreements; reduces cost 3–10x.
- Verifier-generated labels — for math, sympy or a CAS checks whether step k follows from step k−1; for code, partial execution or type checkers mark invalid transitions.
- MCTS bootstrapping — search with a weak PRM, keep trajectories where verifier confirms all steps; iteratively improve the PRM (AlphaZero-style for reasoning).
Segmentation: what counts as a step?
PRM quality depends on consistent step boundaries. Common conventions:
- One line per step (newline-delimited).
- Sentences after “Step N:” markers enforced in the SFT prompt template.
- AST-level steps for code (one statement per step).
Mixed granularity poisons training: if step 3 sometimes includes two logical operations, the PRM learns spurious correlations. Lock the format in SFT before collecting PRM labels.
Loss and calibration
Train with binary cross-entropy per step. At inference, sum or multiply step scores along a path, or use the minimum step score (bottleneck heuristic). Calibrate on a held-out set with human adjudication — uncalibrated PRMs over-reject valid creative proofs. Pair with LLM-as-judge spot checks for drift monitoring.
Using PRMs at inference time
PRMs shine when you spend extra compute at query time to search for better reasoning paths.
Best-of-N with step filtering
- Sample N full completions from the policy.
- Score each step with the PRM; discard traces with any step below threshold t, or rank by product of step scores.
- Return the highest-ranked surviving completion (or majority-vote final answers among survivors).
Compared to ORM best-of-N, PRM filtering removes “lucky wrong logic” completions before they pollute the vote.
Beam search and tree-of-thought
Generate b candidate next steps at each depth, score each extension with the PRM, keep top-k partial traces, repeat until an answer token or max depth. This is the inference backbone described in tree-of-thought papers and extended in modern test-time compute stacks. Beam width trades latency for accuracy; without a strong PRM, wide beams mostly resample fluent nonsense.
When PRM search is not worth it
- Short answers with cheap verifiers (regex, SQL
EXPLAIN) — run the verifier once on the final output instead. - Open-ended creative writing — no ground-truth steps; ORM or human preference models fit better.
- Latency-sensitive chat — PRM beam search can 10–50x token cost; reserve for batch analytics or high-stakes tutoring.
PRMs in reinforcement learning training
Process supervision uses PRM scores as dense reward signals during RL fine-tuning. Outcome supervision (as in GRPO) assigns reward only when the final answer is correct.
Process supervision advantages
- Credit assignment — policy learns which step types fail, not just “whole trace bad.”
- Sample efficiency — one correct trace yields multiple positive step labels; one error yields a precise negative at the failure point.
- Less hacking — models cannot maximize ORM score with plausible-sounding wrong algebra if step PRM penalizes invalid moves.
Process supervision challenges
- Label cost — millions of step labels vs thousands of outcome labels for the same prompt set.
- PRM exploitation — policy learns phrasing that tricks the PRM without improving real reasoning (adversarial step templates).
- Distribution shift — PRM trained on human-written steps may fail on novel policy-generated notation after RL updates.
Production stacks often hybridize: outcome RL (GRPO/PPO) for exploration plus periodic PRM refresh on policy rollouts, or PRM-only filtering before ORM reward on survivors.
Worked example: Harbor Analytics math tutor
Harbor Analytics ships a calculus homework helper. The team compares ORM-only and PRM-augmented inference on 500 held-out problems:
- Baseline — GPT-4o fine-tuned with CoT SFT; ORM best-of-8 on final answer; 71% accuracy on MATH-500 subset.
- PRM training — 12k problems with human step labels
(contract annotators, $0.80/problem); PRM fine-tuned from same base; step
format enforced as
Step k:lines. - PRM inference — best-of-8 generation, discard traces with any PRM step score < 0.4; majority vote on finals among 3–5 survivors per problem.
- Result — 79% accuracy (+8 pp) at 1.3x median latency (many samples filtered early). Beam-4 search hits 82% but 4.2x latency — deployed only for “exam mode” tier.
- Failure audit — 40% of remaining errors are PRM false negatives on valid alternative proofs (integration by parts vs substitution); team adds “OR override” if sympy verifies final and all prior steps parse, even when PRM score is borderline.
- Production rule — default chat uses ORM best-of-4; PRM filter activates when user enables “show work” or problem difficulty tag is olympiad; retrain PRM quarterly on policy rollouts with human spot-check of 200 steps.
The deployment pattern — PRM as optional quality tier, not universal overhead — matches how most teams balance cost and accuracy.
Method decision table
| Need | Best approach | Why |
|---|---|---|
| Grade helpfulness and tone | ORM / preference model (DPO, RLHF) | No objective steps; human rankings suffice. |
| Math/code with verifiable finals | Outcome RL (GRPO) + unit tests | Cheap scalar reward; no step labels needed. |
| Catch invalid intermediate logic | PRM + step labels or verifiers | ORM misses lucky wrong reasoning. |
| Inference-time accuracy boost | PRM-filtered best-of-N or beam search | Search guided by step scores beats blind sampling. |
| Cheapest correctness check | Deterministic verifier on final only | Sympy, test runner, SQL result compare. |
| Monitor production quality | LLM-as-judge on samples + PRM drift alerts | PRM scores on live traffic detect template hacking. |
| Train reasoning from scratch | Process supervision RL + periodic PRM refresh | Dense credit assignment; watch for PRM exploitation. |
| Latency-critical chat | Single-shot CoT, no PRM search | Beam search multiplies tokens and wall time. |
Common pitfalls
- Inconsistent step format — mixed newline and paragraph steps confuse the PRM; enforce a template in SFT first.
- Trusting PRM on alternative valid proofs — training data bias toward one solution path causes false negatives; add verifier override.
- ORM labels on PRM-filtered data only — survivorship bias; PRM may learn to agree with its own filtering loop.
- Wide beam without calibration — explores fluent incorrect branches; calibrate thresholds on human-adjudicated set.
- Ignoring PRM adversarial drift after RL — policy invents step markers that score high but mean nothing; refresh labels from new rollouts.
- Step labels from weak models — garbage step supervision propagates; audit 5–10% of verifier-generated labels manually.
- Summing uncorrelated step scores — one bad early step should cap path score; use min or cumulative product with floor.
- Deploying PRM search on non-reasoning tasks — wasted compute on retrieval QA where a single citation check suffices.
Production checklist
- Define canonical step delimiter in the SFT prompt before any PRM work.
- Collect at least 5k–20k step-labeled examples in your target domain.
- Hold out 15% for calibration; tune accept threshold on precision/recall tradeoff.
- Build deterministic verifier for finals as backstop (sympy, tests, SQL).
- Benchmark ORM best-of-N vs PRM-filtered best-of-N on fixed eval set.
- Measure p50/p95 latency impact before enabling PRM search in default path.
- Log per-step PRM scores in production for drift and hacking detection.
- Schedule quarterly PRM retrain on fresh policy rollouts with human spot checks.
- Document when PRM defers to verifier override (alternative proof paths).
- Pair PRM metrics with evaluation benchmarks on reasoning subsets (GSM8K, MATH, HumanEval).
Key takeaways
- Process reward models score each reasoning step; outcome reward models score only the full completion.
- PRMs catch invalid logic that still produces lucky correct answers — the core failure mode of ORM-only systems.
- Training needs step-level labels (human, verifier, or bootstrapped); step format consistency is non-negotiable.
- Inference uses PRMs to filter best-of-N samples or guide beam/tree search when extra latency is acceptable.
- RL stacks combine process supervision for credit assignment with outcome methods like GRPO; refresh PRMs to fight exploitation.
Related reading
- LLM chain-of-thought explained — step-by-step prompting and reasoning-native models
- GRPO explained — outcome-based RL without a critic network
- LLM test-time compute explained — best-of-N, beam search, and inference scaling
- Reinforcement learning explained — MDPs, reward shaping, and RLHF foundations