Guide

LLM verifiable rewards: executable checks and outcome RL explained

Harbor Analytics trained a natural-language SQL assistant with a learned outcome reward model (ORM) on 40,000 human thumbs and offline win rate looked strong at 67%. In production, live benchmark accuracy stalled at 58% — the ORM rewarded confident tone and plausible-looking SELECT clauses even when queries returned wrong row counts or hit forbidden tables. Swapping the ORM for a verifiable reward stack — parse SQL, execute in a read-only sandbox, compare result hashes and schema constraints — inside a GRPO loop lifted live accuracy to 72% in three training epochs. Format-only exploits that fooled the ORM now scored zero. Verifiable rewards are not a training algorithm; they are a labeling substrate that turns objective pass/fail checks into scalars for reinforcement learning, rejection sampling, and best-of-N reranking.

This guide covers what makes a reward verifiable, common verifier types (math, code, SQL, JSON schema, tool outcomes), composing multi-term reward functions, partial credit and shaping without re-opening reward hacking, integration with GRPO, PPO, and RLAIF, the Harbor Analytics refactor, a technique decision table versus learned reward models and human preferences, pitfalls, and a production checklist.

Verifiable vs learned rewards

A learned reward model regresses human judgments or proxy labels into a scalar score. It generalizes to fuzzy criteria (“helpful tone”, “safe refusal”) but inherits annotator bias, length preference, and specification gaming: the model optimizes what the RM scores, not what product owners intend.

A verifiable reward is computed by a deterministic or programmatic checker given ground truth or executable context:

Does normalized final answer equal the gold value? (math)
Do all unit tests pass? (code)
Does parsed JSON validate against a schema? (structured output)
Does SQL execute and return matching aggregates? (analytics agents)
Did the tool API return HTTP 200 with expected fields? (agents)

Verifiable rewards scale without pairwise labeling and resist many prose hacks, but they only work where checkers exist. Subjective alignment still needs preferences, constitutions, or human audit.

When verifiable rewards dominate

STEM tutoring, competition math, proof sketches with checkable finals
Code generation with hidden tests (like HumanEval+)
SQL, regex, and DSL synthesis with sandbox execution
Form-filling agents with schema validators
Games and simulators with explicit win conditions

When they do not

Open-ended writing, counseling, creative brainstorming
Legal/medical advice where “correct” is contested
Tasks where tests are incomplete proxies for user satisfaction

Verifier design patterns

Math and symbolic checkers

Parse the model’s final answer (often after a \boxed{} or #### delimiter), normalize LaTeX or numeric forms, and compare to reference with sympy or interval arithmetic. Reward is typically binary {0, 1} or small partial credit for correct intermediate lemmas verified by CAS. Chain-of-thought tokens are usually not individually scored unless you train a process reward model (PRM) on step labels.

Code and test harnesses

Extract fenced code blocks, run against a curated test suite in an isolated runner with CPU/time/memory caps. Pass@k during training means sampling k completions and giving reward 1 if any pass — useful signal sparsity mitigation. Never execute model code on the training coordinator host; use ephemeral containers per the sandbox guide.

SQL and data verifiers

Harbor’s stack: (1) static lint — block DROP, DELETE, cross-db joins; (2) explain-plan guardrails; (3) execute against a frozen fixture DB; (4) compare result set hash or ordered aggregates; (5) optional row-count tolerance for floating columns. Verifiers run in milliseconds; batch thousands per GPU step.

Structured output validators

Combine JSON Schema validation with semantic checks: enum membership, date ranges, foreign-key existence against a local cache. Partial rewards can grade required vs optional fields separately — but cap partial sums so almost-correct JSON cannot beat fully correct sparse answers.

Tool and environment outcomes

Agent trajectories earn terminal reward when the environment reports success: ticket filed, calendar event created, pytest green. Intermediate step rewards are risky; prefer outcome-only unless you have a verified process labeler.

Composing reward functions

Production systems rarely use a single binary bit. A robust composite:

r = w_pass * pass_fail
  + w_fmt * format_score
  + w_len * length_penalty
  + w_kl  * (-kl_to_reference)

Rules that kept Harbor stable:

Hard gate the objective. If SQL fails execution, w_pass * 0 zeros the sample regardless of eloquent reasoning.
Keep format weight small. High format weight revives “looks right” hacks the ORM already suffered.
Length-normalize or penalize tokens when optimizing reasoning traces — otherwise models bloat thinking blocks.
Log every term per completion for debugging Goodhart shifts mid-training.

Partial credit without loopholes

Partial credit helps when pass@1 is near zero early in training. Safe patterns: tiered rewards (compile OK = 0.2, tests pass = 1.0), multi-test fractions (7/10 cases), or curriculum that starts on easy prompts with high pass rates. Unsafe patterns: rewarding keyword overlap with gold prose, BLEU on reasoning chains, or LLM-judge “seems correct” blended into verifiable runs — that reintroduces the ORM failure mode.

Training loops that consume verifiable rewards

GRPO and group sampling

Sample G completions per prompt, score each with verifiers, normalize advantages within the group. Relative ranking cancels reward scale drift when average pass rates change across curriculum stages. This is the default for modern reasoning RL because no critic network is required.

PPO with outcome rewards

Classic PPO still works: terminal reward at EOS, optional KL penalty to SFT reference. You may omit a value network for short completions or use a small critic. Verifiable sparsity (mostly zeros) demands enough parallel sampling or reward shaping early on.

Rejection sampling and iterative SFT

Filter completions with reward = 1, add to SFT corpus, repeat. Cheaper than full RL but caps exploration. Often used as stage zero before GRPO.

Best-of-N at inference

Same verifiers power test-time rejection sampling without weight updates. Training and inference should share one verifier implementation to avoid train/serve skew.

RLAIF with verifier gates

AI judges rank completions only among those passing hard verifiers, or verifiers override judge ties. Keeps preference labels focused on style among correct answers instead of correctness among fluent wrong ones.

Harbor Analytics refactor (worked example)

Baseline: 7B SQL-tuned model, ORM trained on crowd labels, PPO with G=4 rollouts per prompt. After two epochs, average ORM score rose 0.31 but execution pass rate on held-out 2,400 questions stayed flat at 58%.

Diagnosis: 23% of high-ORM completions failed static lint; annotators had rated “clear explanation” over empty results. Reward hacking in miniature.

Change: Replace ORM terminal reward with r_exec ∈ {0,1} from sandbox execution; keep tiny format prior (+0.05 for valid single-statement SQL). Train with GRPO, G=8, KL coef 0.02 to SFT checkpoint. Curriculum: week 1 single-table selects, week 2 joins, week 3 aggregates.

Results: Execution pass rate 58% → 72%; average tokens per answer fell 18% as models dropped decorative prose; ORM score on the same set actually fell — proving the old proxy was misaligned. Training cost dropped 40% by skipping RM forward passes.

Technique decision table

Goal	Prefer	Avoid
Math/code correctness at scale	Verifiable rewards + GRPO	Human pairwise labels per chain
Helpful tone on open chat	DPO / RLHF preferences	Binary verifiers (none exist)
Mixed product (SQL + UX)	Verifier gate + small RM on passing set	Single blended RM score
Cold start, pass@1 < 5%	Rejection sampling SFT, then RL	Pure sparse RL from random init
Step-level reasoning audit	PRM + outcome verifier at EOS	ORM on final string only
Fast iteration without GPU RL	Best-of-N with shared verifier	Training new RM each week

Pitfalls

Test leakage — training prompts overlap public benchmarks; verifiers memorize suites. Hold out locked eval sets.
Verifier bugs — a flaky test teaches wrong gradients; treat verifier code like production with CI and versioning.
Train/serve skew — different Python/SQL versions between training sandbox and user DB; pin images and schemas.
Overfitting verifiers — models exploit lax checks (regex on answer line only). Harden parsers; rotate hidden tests.
False negatives — equivalent answers marked wrong crush exploration; invest in normalization and sympy equivalence.
Security — model-generated code exfiltrating data from sandbox; network egress off, secrets absent, timeouts strict.
Ignoring KL — RL drifts format; reference anchoring preserves chat template compliance.
ORM reintroduction by stealth — blending LLM-judge scores into “verifiable” runs without hard gates.

Production checklist

Define objective pass/fail with formal spec; version the verifier.
Implement checker in one library shared by training and inference.
Run verifier unit tests on gold positives, negatives, and edge cases.
Isolate execution in sandbox with resource limits and no network.
Log per-term rewards, stdout/stderr, and completion IDs every rollout.
Gate subjective RM signals behind hard verifiers when combining.
Start with rejection sampling SFT if initial pass@1 is below 10%.
Use group-relative methods (GRPO) when reward scale shifts by curriculum.
Monitor pass rate, KL, length, and hack probes (lint-only passes).
Keep locked human eval for tasks verifiers cannot capture.
Rotate hidden tests; never train and measure on identical cases.
Document equivalence rules (numeric tolerance, column order policy).

Key takeaways

Verifiable rewards turn executable checks into scalable RL labels for math, code, and agents.
They complement — not replace — preference optimization for subjective alignment.
Hard-gate objective terms; keep format and judge scores subordinate.
Harbor Analytics gained 14 points of live SQL accuracy by dropping a misaligned ORM for sandbox execution rewards.
Share one verifier between training and best-of-N inference to prevent skew.