Guide
LLM sampling and decoding strategies explained
A language model does not emit finished sentences. At each step it outputs a vector of logits — raw scores over every token in its vocabulary — and a decoding strategy picks one token to append. That choice, repeated hundreds of times, determines whether your assistant sounds crisp and factual, wanders into hallucination, or loops the same phrase forever. Temperature, top-p, repetition penalties, and beam search are not obscure research knobs; they are the primary levers between reliable JSON extraction and creative brainstorming. This guide explains how each decoder works, when to use it, and how sampling interacts with prompt design and inference serving.
From logits to the next token
After the transformer forward pass, the final linear layer produces one logit per vocabulary entry. Softmax converts logits into a probability distribution summing to 1.0. Decoding is the policy that selects (or searches over) tokens from that distribution, one step at a time, until a stop condition fires.
Each generated token is fed back as input for the next step — autoregressive generation. The context window grows with every token, so long outputs cost more compute and eventually hit the model limit. Decoding therefore affects not only quality but latency and dollar cost per request.
Important distinction: sampling parameters shape the distribution before selection; the selection algorithm (argmax, multinomial draw, beam expansion) runs afterward. Most APIs expose temperature, top-p, and penalties as separate fields — understanding the pipeline helps you debug "same prompt, wild variance" complaints from users.
Greedy decoding — deterministic argmax
Greedy decoding picks the highest-probability token at
every step (argmax). It is fully deterministic given fixed
weights and input: rerun the same prompt, get the same output.
Greedy is ideal when correctness beats variety:
- Structured extraction — JSON keys, SQL fragments, regex-friendly outputs.
- Classification-style tasks — sentiment labels, intent routing, yes/no gates.
- Regression testing — golden-file evals need reproducible generations.
The failure mode is mode collapse into bland loops. Greedy often selects high-frequency function words ("the", "and") and can enter repetitive cycles because the model's locally optimal path is not globally coherent. That is why chat products rarely ship greedy defaults for open-ended dialogue.
Temperature — reshaping the probability curve
Temperature divides logits by a scalar T before softmax. Low temperature (0.1–0.3) sharpens the distribution — the top token dominates. High temperature (0.8–1.2) flattens it — unlikely tokens gain probability mass.
- T → 0 approaches greedy (exact zero is often special-cased to argmax).
- T = 1 uses the model's native distribution.
- T > 1 increases randomness; very high values produce incoherent soup.
Rule of thumb by task type:
- Code and APIs: 0.0–0.2 for syntax fidelity.
- Factual Q&A with citations: 0.2–0.5 to reduce creative fabrication.
- Marketing copy and brainstorming: 0.7–1.0 for lexical variety.
- Creative fiction: 0.9–1.2 if you accept more plot risk.
Temperature alone does not prevent sampling from the long tail. A flat distribution at T=1.0 can still emit rare tokens. That is why production stacks combine temperature with top-p or top-k truncation.
Top-k and top-p (nucleus) sampling
Top-k sampling
Top-k keeps only the k highest-probability tokens, renormalizes their probabilities, and samples from that restricted set. Typical values: k=40–50 for chat, k=1 equivalent to greedy.
Top-k is simple but brittle: when the distribution is sharp, k=50 still includes junk tokens with tiny mass. When the distribution is flat, k=50 may exclude reasonable continuations.
Top-p (nucleus) sampling
Top-p, also called nucleus sampling, sorts tokens by probability and keeps the smallest set whose cumulative probability exceeds p (e.g. p=0.9). The cutoff adapts to context: confident predictions use a tiny nucleus; uncertain predictions admit more candidates.
Most chat APIs default to top-p around 0.9–0.95 combined with temperature 0.7. Lower top-p (0.5–0.8) tightens outputs without freezing them like greedy decoding. OpenAI, Anthropic, and open-source servers (vLLM, TGI) expose both; some treat temperature=0 as "disable sampling" regardless of top-p.
Practical combo for general assistants: temperature 0.7, top-p 0.9. For tool-calling agents that must emit valid function names: temperature 0–0.3, top-p 0.8.
Repetition, frequency, and presence penalties
Autoregressive models love to repeat phrases — especially under greedy or low-temperature settings. Penalties adjust logits for tokens that already appeared in the generated prefix.
- Frequency penalty — subtracts proportional to how many times a token already appeared.
- Presence penalty — flat penalty if a token appeared at least once.
- Repetition penalty (Hugging Face style) — divides logits for repeated tokens by a factor >1.
Light penalties (0.1–0.3 on a 0–2 scale) reduce "the the the" loops without breaking list formatting or boilerplate legal language. Aggressive penalties cause the model to avoid common words entirely, producing stilted prose.
Penalties interact with tokenization: subword tokens for "ing" and "ing." count separately. Long outputs need stronger penalties than short tweets.
Beam search — exploring multiple hypotheses
Beam search maintains b partial sequences (beams) at each step, expanding each by top candidates and keeping the b highest-scoring full paths. It is deterministic given fixed beams and scores.
Beam search dominated machine translation before LLM chat because it optimizes global sequence likelihood better than greedy. For modern instruction-tuned models, wide beams (b>4) often produce generic, over-smoothed text — the "translationese" effect. Today beam search appears mainly in:
- Speech-to-text and captioning pipelines where WER/CER metrics reward likelihood.
- Constrained decoding — grammar-guided JSON, regex-enforced outputs.
- Small beam (b=2–4) reranking — generate N samples, pick best with a reward model.
For conversational LLMs, stochastic sampling plus a good system prompt usually beats beam-5. Use beams when you have a clear scoring function over complete strings, not when you want natural dialogue.
Stop sequences, max tokens, and seed
Decoding ends when any condition triggers:
- End-of-sequence token — the model emits its trained EOS marker.
- Max tokens — hard cap on generated length (budget and safety).
- Stop sequences — user-defined strings (e.g.
\n\nHuman:) that halt generation.
Stop sequences are underused. For chain-of-thought prompts, stopping before a delimiter prevents the model from inventing fake user turns. For RAG answers, stop at citation markers to reduce rambling.
Some APIs offer a seed for pseudo-random sampling — identical seed + prompt + parameters yields identical output on the same hardware and software version. Seeds help regression tests but do not guarantee cross-version reproducibility when weights or kernels change.
Structured output and constrained decoding
When JSON schema compliance matters more than eloquence, prefer constrained decoding over hoping temperature=0 works. Modern stacks (Outlines, Guidance, llama.cpp grammars, provider "JSON mode") mask illegal tokens at each step so the model cannot emit a trailing comma outside an array.
See structured outputs for schema enforcement patterns. Sampling knobs still matter for free-text fields inside the schema — tune those separately from envelope keys.
Tuning by use case
| Use case | Temperature | Top-p | Notes |
|---|---|---|---|
| JSON / tool calls | 0–0.2 | 0.8 | Add grammar or schema constraints |
| Customer support | 0.3–0.5 | 0.9 | Low creativity reduces policy violations |
| General chat | 0.7 | 0.9 | Default for many products |
| Creative writing | 0.9–1.1 | 0.95 | Monitor repetition; add penalties |
| Code completion | 0.0–0.2 | 0.95 | Greedy or near-greedy; FIM templates help |
| RAG answers | 0.2–0.4 | 0.9 | Pair with retrieval, not high randomness |
Log sampling parameters with every request in your observability stack. When users report bad outputs, "temperature was 1.2 on a compliance bot" is a five-minute fix versus retraining instincts.
Common mistakes
- High temperature on factual bots — increases confident-sounding falsehoods.
- Ignoring top-p when temperature is low — even T=0.3 can sample tail tokens if nucleus is wide.
- Max tokens too generous — models ramble; costs spike; repetition penalties fire too late.
- Same settings for draft and final — use high-T brainstorm, low-T rewrite pass.
- Expecting reproducibility without seed — stochastic runs differ; evals need fixed seeds or greedy.
- Beam search for chat — dull, repetitive dialogue unless heavily post-processed.
Production checklist
- Define per-route sampling presets (chat, extract, code) — never one global default.
- Document which parameters your inference engine actually honors (some ignore top-k).
- Set max output tokens from p95 latency budget, not model maximum.
- Add stop sequences for multi-turn templates and tool-call delimiters.
- Log temperature, top-p, penalties, and seed with request IDs.
- A/B test sampling on offline eval sets before changing production defaults.
- Use constrained decoding for machine-readable outputs; sampling alone is insufficient.
- For agents, lower temperature on tool-selection steps; allow higher on user-facing summaries.
- Revisit settings when you swap model versions — optimal temperature shifts per fine-tune.
- Expose "creativity" sliders to end users only as mapped presets, not raw logits controls.
Key takeaways
- Decoding turns per-step logits into a token sequence — it is as important as model weights.
- Greedy is deterministic and best for structure; sampling adds variety for open text.
- Temperature scales randomness; top-p adaptively truncates the tail.
- Penalties fight repetition; tune lightly to avoid broken grammar.
- Beam search suits scoring complete hypotheses, not casual chat.
- Match sampling to task risk — factual and structured work wants low temperature and constraints.
Related reading
- Prompt engineering explained — system prompts and chain-of-thought pair with decoding choices
- LLM inference serving explained — batching, speculative decoding, and production SLOs
- LLM tokenization explained — how subwords affect repetition penalties and stop strings
- LLM hallucinations explained — why high temperature increases confident errors