Guide

LLM best-of-N sampling and rejection sampling explained

Harbor Analytics ships a Python exercise tutor: students submit a function stub, the model writes a solution, and a hidden unit-test harness grades it. A single greedy decode passed all tests only 62% of the time on their 400-problem eval set — failures clustered on off-by-one loops, wrong edge-case branches, and imports the sandbox forbids. Engineers kept the same SFT checkpoint but wrapped inference in best-of-16: sample sixteen completions at temperature 0.7, run the deterministic test suite on each, and return the first passing answer (or the highest-scoring partial if none pass). Pass rate rose to 88%; p95 latency grew from 1.1 s to 4.8 s — acceptable for an async grading flow, unacceptable for live chat. The win came from test-time compute, not from training a bigger model.

Best-of-N sampling generates multiple candidate completions for one prompt, scores them with a reward model, verifier, or rule, and returns the winner. Rejection sampling is the sequential cousin: draw samples until one passes an acceptance test (or exhaust a budget). Both trade latency and token cost for quality without weight updates. This guide covers the algorithms, scorer types (ORM, PRM, deterministic checks), parallel vs sequential budgets, distribution-preserving rejection in decoding, the Harbor Analytics refactor, a technique decision table versus self-consistency and PPO, pitfalls, and a production checklist.

Best-of-N: generate wide, pick one

The pattern is simple:

Sample N completions from the same prompt (usually stochastic decoding with temperature > 0).
Score each completion with a scorer S(prompt, completion).
Select argmax S or the first completion that passes a hard gate (e.g. all unit tests green).

Scorers fall into three buckets:

Deterministic verifiers — unit tests, JSON schema validation, SQL EXPLAIN, compiler success. Cheap, auditable, domain-specific.
Outcome reward models (ORMs) — scalar preference scores over full replies; see outcome reward models.
Process reward models (PRMs) — step-level scores for chain-of-thought; useful when the final token can look right while reasoning is wrong.

Best-of-N is embarrassingly parallel: candidates share the same prompt prefix, so batched inference amortizes prefill cost. The dominant expense is decode tokens × N. For chat products, cap N by intent: billing disputes might get N=8 with an ORM; autocomplete gets N=1.

Rejection sampling: draw until accepted

In classical statistics, rejection sampling draws from a proposal distribution and accepts samples with probability proportional to a target density. LLM pipelines use a pragmatic variant:

Draw a completion from the policy.
If it passes a verifier (or ORM score > threshold), return it.
Otherwise discard and repeat, up to M attempts.

This differs from best-of-N when you stop early: the first passing sample saves tokens if lucky draws are common. Expected cost is M × p_fail^(k-1) generations where p_pass is per-draw success rate. When p_pass is low (hard math proofs), fixed best-of-N with parallel batching usually beats serial rejection.

A separate meaning of “rejection sampling” appears in speculative decoding: a draft model proposes tokens; the target model accepts a prefix via randomized acceptance rules so the output distribution matches sampling from the target alone. That is a speed optimization, not a quality reranker — do not confuse the two in runbooks.

Scorer design: when rules beat neural judges

Harbor Analytics chose deterministic tests first because ORM scores on code were miscalibrated: the ORM preferred verbose, comment-heavy solutions that compiled but violated time limits. Their final stack:

Hard gate — must pass all unit tests within 2 s CPU.
Soft tie-break — among passers, shortest source length (proxy for simplicity).
ORM fallback — when zero passers, pick highest ORM among failures and flag “review suggested.”

Neural scorers shine when no cheap verifier exists: tone, policy compliance, persuasiveness. Pair ORMs with LLM-as-judge only on a held-out slice to detect reward hacking. For reasoning tasks, self-consistency (majority vote across chain-of-thought samples) is a special case of best-of-N where the scorer is answer equality, not a learned reward.

Scaling N: diminishing returns and budgets

If each independent sample has success probability p, the probability that at least one of N samples succeeds is 1 - (1-p)^N. Example: p=0.10, N=16 → 81% success; N=32 → 96%. Gains flatten; doubling N late in the curve is expensive.

Production teams usually:

Plot success vs N on a stratified eval set before picking a default.
Use adaptive N — start with 4, expand to 16 only if the ORM margin between top-2 is below a threshold.
Route easy prompts to N=1 via an intent classifier.
Account for scorer cost: running sixteen sandboxed test suites dominates LLM tokens on small functions.

This is a form of test-time compute: spend FLOPs at inference instead of training larger models. The tradeoff curve is measurable — treat N as a latency SLA knob, not a magic constant.

Harbor Analytics refactor (worked example)

Before: single greedy decode, 62% pass@1, 1.1 s p95.
After: best-of-16 @ T=0.7, test-gated selection, 88% pass@16, 4.8 s p95, ~9× mean output tokens billed.

Implementation notes that mattered:

Diverse seeds — identical failures across draws meant temperature was too low; 0.7 outperformed 0.3.
Shared prefill — one prompt embedding, sixteen decode streams in a continuous batch.
Sandbox isolation — each candidate ran in a fresh process; shared state caused flaky scores.
Telemetry — logged pass@k for k∈{1,4,8,16} weekly; pass@1 crept up after SFT refresh, reducing needed N.

They did not run PPO: verifier pass rate was already a crisp metric, and best-of-N delivered most of the lift. PPO remained on the roadmap for latency-sensitive paths where N must be 1.

Technique decision table

Approach	Prefer when	Avoid when
Best-of-N + verifier	Deterministic checks exist (code, SQL, schema); async latency OK	Verifier runtime exceeds user SLA; p_pass < 1% even at large N
Best-of-N + ORM	Subjective quality; ORM calibrated on production slice	ORM uncorrelated with human wins; reward hacking observed
Serial rejection sampling	High p_pass; early stop saves mean tokens	Low p_pass; parallel batching available
Self-consistency vote	Discrete answers; reasoning diversity helps	Open-ended prose; no extractable final answer
PPO / DPO policy update	Must hit quality at N=1; stable preference data	Quick iteration on scorer; verifier already gives clear signal
Chain-of-thought + PRM beam	Multi-step math; step errors visible	Short replies; PRM label noise high

Common pitfalls

Uncalibrated ORM — reranking amplifies judge blind spots; always anchor with human eval on the reranked slice.
Correlated samples — low temperature yields sixteen near-identical failures; diversity matters.
Hidden scorer cost — sandboxed tests or big PRM forward passes can exceed LLM bill.
Length hacking — ORMs often prefer verbosity; normalize or penalize tokens in the scorer.
Unsafe code execution — never run model output in the host process; use isolated runners with timeouts.
Reporting pass@1 only — stakeholders see inflated quality if metrics use best-of-N but prod ships pass@1.
Ignoring cache — identical prompts should reuse candidate pools within a session where policy allows.

Production checklist

Define the scorer (verifier, ORM, or hybrid) and its failure modes before choosing N.
Measure pass@k curves on a stratified eval set for k ∈ {1, 2, 4, 8, 16, 32}.
Pick default N from latency SLA and marginal pass-rate gain.
Batch candidate generation; share prompt prefill across streams.
Isolate verifier execution (timeouts, memory caps, no network egress).
Log per-request: N, scores, winner index, verifier latency, total tokens.
Route low-stakes traffic to N=1; reserve high N for flagged intents.
Re-evaluate when the base model or scorer version changes.
Document whether user-facing metrics are pass@1 or pass@N.
Pair inference-time search with offline DPO when N=1 must eventually match best-of-N quality.

Key takeaways

Best-of-N trades parallel decode cost for higher success without retraining.
Deterministic verifiers beat ORMs when cheap, auditable checks exist.
Harbor Analytics lifted Python exercise pass rate from 62% to 88% with best-of-16 and unit tests.
Rejection sampling (early stop) and speculative-decoding rejection are different mechanisms.
Plot pass@k and calibrate N to latency SLAs; report metrics honestly.