Guide
LLM best-of-N sampling and rejection sampling explained
Harbor Analytics ships a Python exercise tutor: students submit a function stub, the model writes a solution, and a hidden unit-test harness grades it. A single greedy decode passed all tests only 62% of the time on their 400-problem eval set — failures clustered on off-by-one loops, wrong edge-case branches, and imports the sandbox forbids. Engineers kept the same SFT checkpoint but wrapped inference in best-of-16: sample sixteen completions at temperature 0.7, run the deterministic test suite on each, and return the first passing answer (or the highest-scoring partial if none pass). Pass rate rose to 88%; p95 latency grew from 1.1 s to 4.8 s — acceptable for an async grading flow, unacceptable for live chat. The win came from test-time compute, not from training a bigger model.
Best-of-N sampling generates multiple candidate completions for one prompt, scores them with a reward model, verifier, or rule, and returns the winner. Rejection sampling is the sequential cousin: draw samples until one passes an acceptance test (or exhaust a budget). Both trade latency and token cost for quality without weight updates. This guide covers the algorithms, scorer types (ORM, PRM, deterministic checks), parallel vs sequential budgets, distribution-preserving rejection in decoding, the Harbor Analytics refactor, a technique decision table versus self-consistency and PPO, pitfalls, and a production checklist.
Best-of-N: generate wide, pick one
The pattern is simple:
- Sample N completions from the same prompt (usually stochastic decoding with temperature > 0).
- Score each completion with a scorer S(prompt, completion).
- Select
argmax Sor the first completion that passes a hard gate (e.g. all unit tests green).
Scorers fall into three buckets:
- Deterministic verifiers — unit tests, JSON schema validation,
SQL
EXPLAIN, compiler success. Cheap, auditable, domain-specific. - Outcome reward models (ORMs) — scalar preference scores over full replies; see outcome reward models.
- Process reward models (PRMs) — step-level scores for chain-of-thought; useful when the final token can look right while reasoning is wrong.
Best-of-N is embarrassingly parallel: candidates share the same prompt prefix, so batched inference amortizes prefill cost. The dominant expense is decode tokens × N. For chat products, cap N by intent: billing disputes might get N=8 with an ORM; autocomplete gets N=1.
Rejection sampling: draw until accepted
In classical statistics, rejection sampling draws from a proposal distribution and accepts samples with probability proportional to a target density. LLM pipelines use a pragmatic variant:
- Draw a completion from the policy.
- If it passes a verifier (or ORM score > threshold), return it.
- Otherwise discard and repeat, up to M attempts.
This differs from best-of-N when you stop early: the first passing sample
saves tokens if lucky draws are common. Expected cost is
M × p_fail^(k-1) generations where p_pass is
per-draw success rate. When p_pass is low (hard math proofs),
fixed best-of-N with parallel batching usually beats serial rejection.
A separate meaning of “rejection sampling” appears in speculative decoding: a draft model proposes tokens; the target model accepts a prefix via randomized acceptance rules so the output distribution matches sampling from the target alone. That is a speed optimization, not a quality reranker — do not confuse the two in runbooks.
Scorer design: when rules beat neural judges
Harbor Analytics chose deterministic tests first because ORM scores on code were miscalibrated: the ORM preferred verbose, comment-heavy solutions that compiled but violated time limits. Their final stack:
- Hard gate — must pass all unit tests within 2 s CPU.
- Soft tie-break — among passers, shortest source length (proxy for simplicity).
- ORM fallback — when zero passers, pick highest ORM among failures and flag “review suggested.”
Neural scorers shine when no cheap verifier exists: tone, policy compliance, persuasiveness. Pair ORMs with LLM-as-judge only on a held-out slice to detect reward hacking. For reasoning tasks, self-consistency (majority vote across chain-of-thought samples) is a special case of best-of-N where the scorer is answer equality, not a learned reward.
Scaling N: diminishing returns and budgets
If each independent sample has success probability p, the probability that
at least one of N samples succeeds is 1 - (1-p)^N. Example:
p=0.10, N=16 → 81% success; N=32 → 96%. Gains
flatten; doubling N late in the curve is expensive.
Production teams usually:
- Plot success vs N on a stratified eval set before picking a default.
- Use adaptive N — start with 4, expand to 16 only if the ORM margin between top-2 is below a threshold.
- Route easy prompts to N=1 via an intent classifier.
- Account for scorer cost: running sixteen sandboxed test suites dominates LLM tokens on small functions.
This is a form of test-time compute: spend FLOPs at inference instead of training larger models. The tradeoff curve is measurable — treat N as a latency SLA knob, not a magic constant.
Harbor Analytics refactor (worked example)
Before: single greedy decode, 62% pass@1, 1.1 s p95.
After: best-of-16 @ T=0.7, test-gated selection, 88% pass@16,
4.8 s p95, ~9× mean output tokens billed.
Implementation notes that mattered:
- Diverse seeds — identical failures across draws meant temperature was too low; 0.7 outperformed 0.3.
- Shared prefill — one prompt embedding, sixteen decode streams in a continuous batch.
- Sandbox isolation — each candidate ran in a fresh process; shared state caused flaky scores.
- Telemetry — logged pass@k for k∈{1,4,8,16} weekly; pass@1 crept up after SFT refresh, reducing needed N.
They did not run PPO: verifier pass rate was already a crisp metric, and best-of-N delivered most of the lift. PPO remained on the roadmap for latency-sensitive paths where N must be 1.
Technique decision table
| Approach | Prefer when | Avoid when |
|---|---|---|
| Best-of-N + verifier | Deterministic checks exist (code, SQL, schema); async latency OK | Verifier runtime exceeds user SLA; p_pass < 1% even at large N |
| Best-of-N + ORM | Subjective quality; ORM calibrated on production slice | ORM uncorrelated with human wins; reward hacking observed |
| Serial rejection sampling | High p_pass; early stop saves mean tokens | Low p_pass; parallel batching available |
| Self-consistency vote | Discrete answers; reasoning diversity helps | Open-ended prose; no extractable final answer |
| PPO / DPO policy update | Must hit quality at N=1; stable preference data | Quick iteration on scorer; verifier already gives clear signal |
| Chain-of-thought + PRM beam | Multi-step math; step errors visible | Short replies; PRM label noise high |
Common pitfalls
- Uncalibrated ORM — reranking amplifies judge blind spots; always anchor with human eval on the reranked slice.
- Correlated samples — low temperature yields sixteen near-identical failures; diversity matters.
- Hidden scorer cost — sandboxed tests or big PRM forward passes can exceed LLM bill.
- Length hacking — ORMs often prefer verbosity; normalize or penalize tokens in the scorer.
- Unsafe code execution — never run model output in the host process; use isolated runners with timeouts.
- Reporting pass@1 only — stakeholders see inflated quality if metrics use best-of-N but prod ships pass@1.
- Ignoring cache — identical prompts should reuse candidate pools within a session where policy allows.
Production checklist
- Define the scorer (verifier, ORM, or hybrid) and its failure modes before choosing N.
- Measure pass@k curves on a stratified eval set for k ∈ {1, 2, 4, 8, 16, 32}.
- Pick default N from latency SLA and marginal pass-rate gain.
- Batch candidate generation; share prompt prefill across streams.
- Isolate verifier execution (timeouts, memory caps, no network egress).
- Log per-request: N, scores, winner index, verifier latency, total tokens.
- Route low-stakes traffic to N=1; reserve high N for flagged intents.
- Re-evaluate when the base model or scorer version changes.
- Document whether user-facing metrics are pass@1 or pass@N.
- Pair inference-time search with offline DPO when N=1 must eventually match best-of-N quality.
Key takeaways
- Best-of-N trades parallel decode cost for higher success without retraining.
- Deterministic verifiers beat ORMs when cheap, auditable checks exist.
- Harbor Analytics lifted Python exercise pass rate from 62% to 88% with best-of-16 and unit tests.
- Rejection sampling (early stop) and speculative-decoding rejection are different mechanisms.
- Plot pass@k and calibrate N to latency SLAs; report metrics honestly.
Related reading
- Outcome reward models explained — training scorers for reranking
- LLM test-time compute explained — scaling inference budgets broadly
- LLM self-consistency reasoning explained — majority vote as a scorer
- LLM speculative decoding explained — distribution-preserving rejection for speed