Guide

LLM text decoding strategies explained

Your support chatbot answers correctly but sounds stiff — or it loops the same phrase three times in one paragraph. The model weights are fine; the decoding strategy is wrong. After a transformer computes logits for every token in its vocabulary, something must decide which token to append next. That choice — greedy argmax, beam search, or stochastic sampling with temperature — shapes fluency, creativity, repetition, and latency more than most teams realize. This guide covers the autoregressive decode loop, deterministic vs sampling methods, top-k and nucleus (top-p) filtering, repetition penalties and stop sequences, a Harbor Support ticket-reply worked example, a method decision table, common pitfalls, and a production checklist alongside our transformer architecture guide, tokenization guide, and inference serving overview.

What decoding is — and where it sits in the stack

A language model is a conditional probability machine. Given prior tokens x₁…x_t, it outputs a vector of scores (logits) over the vocabulary for token x_t+1. Decoding is the policy that turns those logits into an actual next token, then repeats until a stop condition.

Decoding is not training, fine-tuning, or RAG retrieval. It runs at inference time inside serving engines like vLLM, TGI, or llama.cpp. Nor is it the same as speculative decoding, which speeds up how forward passes execute — the sampling policy still applies to whichever logits emerge from verification.

Core concepts

Logits — raw model outputs before softmax; higher means more likely.
Softmax — converts logits to a probability distribution summing to 1.
Autoregressive loop — append chosen token, extend KV cache, predict again.
Deterministic decoding — always picks the highest-scoring path (greedy, beam).
Stochastic decoding — samples from the distribution; introduces randomness.
Temperature — scales logits before softmax; lower = sharper, higher = flatter.
Top-k / top-p — truncate the distribution to the k most likely tokens or smallest set whose cumulative probability exceeds p.
Repetition penalty — down-weights tokens that already appeared in context.

The autoregressive generation loop

Generation splits into prefill and decode. Prefill processes the entire prompt in one parallel forward pass, populating the KV cache. Decode then runs one token at a time: forward pass → logits → decoding policy → append token → update cache → repeat.

Each decode step is memory-bandwidth bound for large models — you touch every parameter for a single new token. That is why output length dominates latency and why teams pair decoding tuning with speculative decoding or quantization. But even a fast engine produces bad text if the sampling policy is wrong.

Stop conditions end the loop: emit an end-of-sequence (EOS) token, hit a max-token budget, match a custom stop sequence (e.g. </answer>), or satisfy a structured-output schema validator. Production APIs expose these as max_tokens, stop, and response_format parameters.

Deterministic decoding: greedy and beam search

Greedy decoding

Greedy picks argmax — the single highest-probability token at each step. It is fast, reproducible, and works well for short, factual completions where one dominant answer exists (classification-style prompts, JSON keys, code syntax).

Greedy fails when the locally best token leads to a dead end. Language is full of high-probability function words ("the", "a") that greedy locks onto, producing dull or repetitive prose. A famous illustration: greedy might start "The cat sat on the" correctly but never recover if an early suboptimal choice forecloses a better global sentence.

Beam search

Beam search keeps the top b partial sequences (beams) at each step instead of one. It scores cumulative log-probability across tokens and returns the highest-scoring complete sequence when beams terminate. Wider beams explore more alternatives at the cost of b× compute per step.

Beam search dominated machine translation and summarization before chat LLMs. It reduces some greedy errors but still favors "safe", high-probability text — often bland for open-ended chat. Most modern chat APIs default to sampling, not beam search, because users want varied, natural phrasing. Beam remains useful for:

Machine translation and captioning with reference eval (BLEU, CIDEr).
Constrained generation where diversity hurts (unit tests, regex-shaped outputs).
Small beam widths (2–4) on code completion when determinism matters.

Stochastic decoding: temperature, top-k, and top-p

Temperature

Before softmax, divide logits by temperature T. T < 1 sharpens the distribution — the top token dominates (approaches greedy as T → 0). T > 1 flattens it — unlikely tokens gain probability, increasing creativity and risk. T = 1 uses the model's native distribution.

Typical chat defaults: T ≈ 0.7–1.0 for creative writing, T ≈ 0–0.3 for factual Q&A or tool-calling where hallucination is costly. Temperature interacts with hallucination risk: higher T does not create false facts from nothing, but it makes the model more willing to sample low-confidence continuations.

Top-k sampling

Top-k keeps only the k highest-probability tokens, renormalizes, and samples. k = 50 is a common default. Very low k (1–5) behaves almost greedy; very high k approaches full-vocabulary sampling. The weakness: k is fixed regardless of how peaked the distribution is — at a confident step, k = 50 still admits 49 unlikely tokens; at an uncertain step, k = 50 might exclude valid options.

Top-p (nucleus) sampling

Top-p (nucleus sampling) sorts tokens by probability and keeps the smallest set whose cumulative mass ≥ p (e.g. p = 0.9). The candidate pool adapts to context: peaked distributions use few tokens; flat distributions admit more. Most production chat stacks use top-p (0.85–0.95) with temperature, often disabling top-k or setting k very high so top-p dominates.

Combining knobs

A practical recipe: temperature 0.7, top-p 0.9, top-k disabled (or 0 = off). For deterministic extraction: temperature 0 (greedy) or 0.1 with low top-p. Logit bias and banned-token lists offer per-token nudges without changing global temperature — useful for suppressing profanity or forcing a specific JSON delimiter.

Repetition control and length shaping

Autoregressive models loop because high-probability tokens stay high-probability once context includes them. Mitigations:

Repetition penalty — multiply logits of tokens already in context by a factor < 1 (typical 1.05–1.2). Too aggressive causes synonym hopping and broken grammar.
Frequency penalty — penalizes proportional to how often a token appeared (OpenAI-style frequency_penalty).
Presence penalty — flat penalty if token appeared at all (presence_penalty).
No-repeat n-gram — hard ban on repeating any n-gram (common in translation; rare in open chat because it can block valid phrases).
Min/max tokens — force minimum length for summarization; cap max to control cost.

These penalties apply after temperature scaling and before sampling. Tune them on real user transcripts — synthetic prompts hide repetition that only appears in multi-turn threads when the model quotes its own prior answers.

Worked example: Harbor Support ticket-reply bot

Harbor Support routes tier-1 tickets through a fine-tuned 8B model. Product requirements: accurate policy citations, warm tone, no verbatim repetition of the ticket subject, responses under 200 tokens, and reproducible A/B tests.

Baseline (failed): greedy decoding, temperature 0. Replies were factually correct but robotic — every refund email opened with "Thank you for contacting Harbor Support regarding your inquiry." CSAT dropped 12 points in a blind test.

Revision A: temperature 0.8, top-p 0.92, repetition penalty 1.15. Tone improved, but two of fifty test tickets hallucinated a return window not in policy — high T sampled an unlikely but fluent continuation.

Shipping config: temperature 0.4, top-p 0.88, repetition penalty 1.1, presence penalty 0.3, max_tokens 180, stop sequences ["\n\nCustomer:", "---"]. Policy paragraphs retrieved via RAG are injected in prefill; decoding only shapes phrasing. Greedy is used for a structured resolution_code field (temperature 0, 5-token cap) while the body uses sampling. CSAT recovered; hallucination rate matched the RAG-only baseline in eval.

Lesson: split decoding policies per output segment when a response mixes free text and constrained labels. One global temperature rarely fits both.

Decoding method decision table

Method	Best for	Avoid when	Latency
Greedy (T=0)	JSON keys, classification, code tokens, reproducible eval	Open-ended chat, creative writing	Lowest
Beam search (b=4–8)	Translation, captioning, constrained strings	Diverse chat, long outputs (cost scales with b)	High
Top-p + temperature	General chat, support bots, drafting	Strict determinism or audit trails requiring identical reruns	Low
Top-k only	Legacy stacks, simple experimentation	When distribution shape varies widely step to step	Low
High temperature (>1)	Brainstorming, fiction, synthetic data diversity	Factual Q&A, legal/medical, tool arguments	Low

Common pitfalls

Tuning temperature alone — top-p and repetition penalties matter as much; grid-search all three on held-out conversations.
Using beam search for chat — outputs sound like Wikipedia; users perceive "AI slop."
Ignoring tokenizer effects — a "word" is multiple tokens; repetition penalty applies per token ID, not per word (see tokenization).
Same config for prefill and decode — some APIs conflate them; only decode steps should sample (prefill has no choice).
Non-reproducible "temperature 0" — GPU kernels, batching, and flash attention can still introduce tiny floating-point variance; use seeded sampling if audits require bit-identical reruns.
Max tokens too low — truncated JSON breaks parsers; set generous caps with stop sequences instead.
Evaluating decoding on single-turn prompts — repetition and drift appear in multi-turn threads; test full sessions.

Production checklist

Document default temperature, top-p, top-k, and penalty values per use case (chat vs extract vs code).
Expose decoding params in API requests for power users; log them with each completion for debugging.
Run offline eval sweeps (T × top-p grid) on 200+ real prompts with human or LLM-judge scoring.
Split policies for structured vs free-text segments in the same response.
Set max_tokens from p95 desired length + buffer; prefer stop sequences over hard truncation.
Monitor repetition rate and distinct-n-gram metrics in production logs.
Pair sampling with RAG or tool grounding when factual accuracy matters more than phrasing variety.
Version decoding defaults when changing models — optimal T for Llama 3 may not suit Mistral.

Key takeaways

Decoding is the policy that picks the next token from model logits — it controls fluency, diversity, and repetition at inference time.
Greedy and beam search are deterministic; top-p and temperature sampling produce natural chat at the cost of variance.
Top-p adapts candidate pool size to context uncertainty; it usually beats fixed top-k for open-ended generation.
Repetition and presence penalties fix loops but need tuning on real multi-turn transcripts.
Match decoding strategy to task: low temperature for facts and schemas, moderate top-p for support and chat, beam only when diversity hurts.