Guide
LLM speculative decoding explained
Large language models generate text one token at a time. Each decode step loads the full model weights from GPU memory — so latency scales linearly with output length even when compute units sit idle. Speculative decoding breaks that bottleneck by proposing several candidate tokens with a small, fast draft model, then verifying them in a single forward pass of the large target model. Accepted drafts advance the sequence in bulk; rejected drafts roll back with no quality loss. This guide explains the draft-and-verify loop, acceptance-rate math, modern variants like Medusa and EAGLE, how serving engines integrate speculation, and when the technique actually pays off — building on inference serving and KV cache fundamentals.
Why autoregressive decode is the bottleneck
Transformer inference splits into two phases. Prefill processes the entire prompt in parallel — one heavy matmul over all input tokens. Decode then generates each new token sequentially: read weights, run one forward step, sample or argmax the next token, append to the KV cache, repeat.
For long completions — chat answers, code generation, summarization — decode dominates wall clock time. The limiting factor is usually memory bandwidth, not FLOPs: every new token touches every parameter layer. Quantization shrinks bytes per token; batching improves GPU utilization; but neither removes the fundamental serial loop of autoregressive generation.
Speculative decoding attacks the serial loop directly. Instead of one target-model step per token, you amortize target-model work across multiple tokens whenever the draft model guesses correctly — which happens surprisingly often when draft and target share training distribution and tokenizer vocabulary.
The draft-and-verify loop
Classical speculative decoding (Leviathan et al., 2022; Chen et al., 2023) uses two models:
- Draft model — smaller and faster (e.g., 1B parameters vs 70B target).
Runs k autoregressive steps to propose tokens
d₁, d₂, …, dₖ. - Target model — the quality model users expect. Runs one parallel forward pass over the draft prefix and produces its own distribution at each position.
Verification compares draft tokens to target distributions position by position:
- Start from the current sequence context (prompt + tokens generated so far).
- Draft model generates k candidate tokens cheaply.
- Target model evaluates all k positions in one batched forward pass.
- Accept draft token
dᵢif it matches the target's sampled token under a rejection-sampling scheme that preserves the target model's output distribution exactly (lossless speculation). - On first mismatch, sample a correction from the adjusted target distribution at that position and discard remaining drafts.
- Append all accepted tokens (plus any correction) to the sequence; update KV caches; repeat until end-of-sequence or max length.
The critical property: when configured correctly, speculative decoding is distributionally identical to running the target model alone. You trade extra draft-model compute and memory for fewer target-model steps — a pure latency win when acceptance rates are high.
Acceptance rate and expected speedup
If the draft model accepts fraction α of proposed tokens on average and you draft
k tokens per round, each successful round advances roughly 1 + α·k tokens
per target forward pass (the +1 accounts for the correction sample on rejection).
Speedup is bounded by how often drafts align with the target — typically 60–90% per position
when draft and target are well matched, yielding 1.5×–3× throughput improvements in practice.
Poor draft models hurt twice: low acceptance means wasted draft compute and extra KV cache churn from rolled-back prefixes. Measuring acceptance rate per workload is mandatory before enabling speculation in production.
Choosing and training draft models
Draft model selection is the main tuning lever:
- Same-family smaller checkpoint — e.g., Llama-3.2-1B drafting for Llama-3.1-70B. Shared tokenizer and similar pretraining data maximize acceptance.
- Distilled draft heads — train a lightweight module on target-model hidden states (EAGLE, EAGLE-2) to predict multiple future tokens without a separate full model.
- N-gram and prompt lookup — for repetitive templates (JSON schemas, boilerplate code), retrieve continuations from recent context without any neural draft. Nearly free when patterns repeat.
- Quantized drafts — INT4 or FP8 draft models further reduce per-step latency; pair with quantized target weights when memory is tight.
Mismatched tokenizers, different chat templates, or domain-shifted drafts (legal model drafting for a code model) collapse acceptance rates. Always validate on representative prompts — RAG-heavy workloads with long retrieved chunks behave differently from short Q&A.
Modern variants beyond two-model speculation
Medusa heads
Medusa attaches multiple classification heads to the target model itself, each predicting a token offset (t+1, t+2, …). One forward pass proposes several continuations without a separate draft model. Verification still filters bad branches. Medusa trades a small accuracy overhead for simpler deployment — no second model to load — but adds training or fine-tuning cost for the extra heads.
EAGLE and hidden-state drafting
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) trains a draft network on target hidden features to predict future tokens with high alignment. EAGLE-2 adds tree-structured candidate exploration — multiple draft branches verified in parallel — pushing speedups toward 2×–3× on code and chat benchmarks when acceptance stays above 70%.
Lookahead and parallel decoding trees
Some serving stacks build token trees: draft multiple branches, verify all nodes in one batched target pass, commit the longest accepted path. This generalizes speculation from a linear chain to a shallow search tree — useful when sampling temperature is low and the model's next tokens are highly peaked.
Integration with serving engines
Production speculation lives inside the decode scheduler, not your application code:
- vLLM — supports speculative decoding with a separate draft model or
n-gram proposers; configure
speculative_modelandnum_speculative_tokens. Works with continuous batching and PagedAttention; draft KV and target KV must be tracked separately per sequence. - Hugging Face TGI — exposes speculation flags for compatible model pairs; check release notes for Medusa checkpoint support.
- TensorRT-LLM / FasterTransformer — vendor-optimized paths for draft-target pairs on NVIDIA hardware; strongest on fixed model combos with tuned kernels.
- llama.cpp — draft model via
--model-draftfor local inference; acceptance visible in perf logs for desktop experimentation.
Enable speculation only after baseline serving metrics are stable. Track tokens per second, time to first token (TTFT), and acceptance rate separately — speculation should not regress TTFT on short outputs where draft overhead dominates.
Memory, cost, and when speculation fails
Speculation is not free lunch:
- Extra GPU memory — draft model weights plus duplicate KV cache entries during verification. On memory-constrained GPUs, speculation can reduce batch capacity and lower aggregate throughput despite faster per-sequence decode.
- Short completions — drafting overhead exceeds savings below ~50–100 output tokens. Disable for classification-style one-token answers.
- High temperature sampling — stochastic targets disagree with greedy drafts more often; acceptance drops. Speculation pairs best with low-temperature or greedy decode.
- Multi-tenant batching — sequences with speculation enabled compete for draft slots; mixed speculative/non-speculative batches need scheduler support.
- Quality-sensitive workloads — lossless speculation preserves the target distribution, but Medusa-style approximations may not; verify eval scores before shipping.
Run A/B tests on real traffic: compare p50/p95 latency, cost per 1M tokens, and task-specific quality metrics with speculation on vs off. A 2× micro-benchmark speedup can vanish under production batching if memory pressure shrinks concurrent sequences.
Common mistakes
- Using a draft model from a different tokenizer family — silent garbage acceptance metrics.
- Enabling speculation without monitoring acceptance rate — burning GPU on useless draft passes.
- Setting draft length k too high — verification cost grows while marginal acceptance per extra position falls.
- Expecting TTFT improvements — speculation only accelerates decode after prefill completes.
- Ignoring KV cache invalidation on rollback — buggy implementations leak stale draft states into target context.
- Applying speculation to embedding-only or encoder models — the technique is decode-specific for autoregressive transformers.
Production checklist
- Baseline tokens/sec and p95 latency without speculation on representative prompts.
- Pair draft and target from the same model family with identical tokenizer and chat template.
- Measure acceptance rate per workload slice (chat, code, RAG, JSON).
- Tune draft length k (start at 4–5; sweep 2–8) for best tokens/sec per GPU.
- Confirm lossless verification or document acceptable approximation (Medusa) with eval regression tests.
- Account for extra VRAM — reduce max batch size if OOM errors appear.
- Disable speculation for short max_tokens requests via routing rules.
- Integrate with your serving engine's native speculation API — avoid hand-rolled draft loops in application middleware.
- Re-test after model upgrades — acceptance is not portable across checkpoint versions.
Key takeaways
- Autoregressive decode is bandwidth-bound; speculation reduces target-model steps by batch-verifying draft tokens.
- Well-matched draft models achieve 60–90% acceptance, translating to 1.5×–3× decode speedups without changing target output distribution (lossless mode).
- Variants like Medusa, EAGLE, and n-gram lookup trade implementation complexity for draft quality and memory footprint.
- Production wins require serving-engine integration, acceptance monitoring, and workload-aware enablement — not every request benefits.
- Speculation complements — does not replace — quantization, batching, and KV cache optimization in a full inference stack.
Related reading
- LLM inference serving explained — continuous batching, vLLM, TTFT vs throughput SLOs
- LLM KV cache explained — prefill vs decode, memory scaling, PagedAttention
- LLM sampling and decoding strategies — temperature, top-p, and how sampling affects draft acceptance
- LLM model quantization for inference — INT4/FP8 weights to free memory for draft models