Guide

LLM CUDA graphs for decode inference explained

Harbor Analytics profiled a 13B chat replica under steady decode load. GPU utilization sat at 71%, yet per-token latency plateaued above 22 ms even with an empty queue. Nsight Systems showed the bottleneck was not matmul arithmetic — it was the CPU launching hundreds of tiny CUDA kernels per decode step across attention, layernorm, and MLP layers. Each launch costs microseconds of host-side overhead; multiplied by batch size, layer count, and token rate, that overhead consumed a third of wall-clock time. Enabling CUDA graphs on the decode path dropped median inter-token latency from 22 ms to 14 ms and raised sustainable throughput 38% without adding GPUs.

A CUDA graph records a fixed sequence of GPU operations once, then replays the entire sequence with a single launch call. For autoregressive decode — where each step processes one new token per sequence and layer shapes repeat — graphs eliminate per-kernel CPU dispatch. They pair naturally with continuous batching and PagedAttention when batch sizes fall into pre-captured buckets. This guide covers capture and replay mechanics, why prefill usually stays eager, shape-bucket design, memory tradeoffs, vLLM and engine configuration, the Harbor Analytics refactor, a technique decision table versus eager decode, pitfalls, and a production checklist.

What CUDA graphs do (and do not do)

In eager PyTorch execution, every operator in the forward pass issues a separate CUDA kernel launch from the host CPU. Transformer decode is launch-heavy: even a 32-layer model may fire 100+ kernels per token step when you count fused and unfused paths. At small batch sizes the GPU finishes each kernel quickly, so launch latency dominates.

CUDA graph capture records the kernel sequence into a graph object. Replay submits the whole graph with one API call, amortizing host overhead across every kernel in the chain. Graphs do not change the math — they change how work reaches the GPU.

Constraints that matter for LLMs

Static topology. The same operators must run in the same order with the same kernel arguments (except pointers updated at replay).
Fixed shapes per bucket. Batch size (number of active sequences in the iteration), head layout, and dtype must match the captured bucket. A new batch size triggers a new capture or falls back to eager.
Address stability. Graph replay updates input/output pointers but expects backing allocations to remain valid. PagedAttention block tables swap logical pages while physical pool addresses stay in managed arenas.
No dynamic control flow on GPU. Conditional branches that change kernel lists break capture. Engines pad inactive slots instead.

These constraints are why graphs shine on decode (repeated one-token steps) and rarely on prefill (variable prompt lengths, changing attention matrices).

Decode vs prefill: where graphs belong

LLM inference splits into two phases with different compute profiles:

Phase	Tokens per step	Shape behavior	Graph fit
Prefill (prompt processing)	Many tokens at once	Sequence length varies every request	Poor — eager or chunked prefill
Decode (autoregressive)	One new token per sequence	Batch size changes slowly within buckets	Excellent with bucket warmup

Prefill is compute-bound matmul on large token matrices; chunked prefill schedules variable prompt chunks without graph capture. Decode is memory-bandwidth and launch-overhead sensitive — exactly where graphs help. Production stacks therefore run eager or partially fused prefill plus graphed decode in the same worker.

Shape buckets and batch-size quantization

Serving engines like vLLM capture separate graphs for discrete batch sizes — the count of sequences participating in a decode iteration. When live batch size is 37 but only buckets 32 and 64 exist, the scheduler either pads to 64 (wasting compute) or runs eager for that step.

Bucket design principles

Cover the mode, not every integer. Log-spaced buckets (1, 2, 4, 8, 16, 32, 64, 128) match typical continuous batching concurrency without capturing 128 separate graphs.
Pre-warm at deploy. Run synthetic decode at each bucket during warmup so the first real user does not pay capture latency.
Cap max bucket below OOM. Graph capture allocates scratch buffers; the largest bucket must leave headroom for KV growth and prefix-cache blocks.
Align with scheduler max batch. If max_num_seqs is 256 but you only capture to 128, sizes 129–256 always eager-fallback.

Harbor Analytics configured buckets at 1, 2, 4, 8, 16, 32, 48, 64 matching their P95 concurrent decode width. Padding waste averaged 4.2% of decode FLOPs — acceptable versus 30%+ launch overhead before graphs.

Pairing graphs with PagedAttention and prefix cache

PagedAttention stores KV cache in fixed-size blocks with a per-sequence block table. Decode graphs treat the block table pointer as a replay-time argument while the captured kernel topology stays fixed. When sequences grow, new blocks append; the graph does not re-capture unless batch size changes buckets.

Prefix caching adds complexity: shared prefix blocks mean sequences with different logical lengths may share physical KV pages. Engines must ensure graph capture includes the block-table gather kernels used for paged attention and that cache hits do not introduce eager-only code paths. A common pitfall is disabling graphs when prefix hit rate exceeds a threshold — fix the kernel path instead.

Memory tradeoff: each captured bucket holds graph-private scratch space. Eight buckets at 64-wide decode can add hundreds of megabytes of reserved VRAM. Monitor torch.cuda.max_memory_allocated after capture completes, not just after model load.

vLLM and engine configuration

vLLM exposes CUDA graph usage through environment and CLI flags (exact names vary by version; check your release notes):

--enforce-eager — disables graphs entirely for debugging; use only in dev.
max_num_seqs / max_num_batched_tokens — upper bounds that must align with largest graph bucket.
Graph padding policy — whether to pad partial batches up to the next bucket or fall back to eager.

When upgrading vLLM, re-benchmark decode latency: graph capture logic changes between minor releases. Pair graph rollout with SLO dashboards tracking P50/P99 inter-token latency separately from TTFT (which is prefill-dominated).

Tensor-parallel replicas capture graphs per rank; all ranks must capture the same bucket set or collective decode deadlocks. Warmup jobs should run through the same TP width as production.

Harbor Analytics decode refactor

Before the refactor, Harbor ran eager decode on a 13B model with continuous batching enabled but graphs off (leftover from a debugging session). Fleet P50 inter-token latency was 22 ms at batch 32; Nsight showed 31% of CPU time in cudaLaunchKernel. After enabling graph capture with eight pre-warmed buckets and fixing a prefix-cache branch that forced eager fallback on shared prompts, P50 fell to 14 ms and P99 from 41 ms to 26 ms. Throughput per A100 rose from 1,420 to 1,960 tokens/s without queue depth changes.

The team added a metric decode_graph_hit_rate — share of decode steps that replayed a graph versus eager fallback. Target: above 92%. Alerts fire below 85%, usually indicating a new batch-size spike or a regression in prefix-cache integration.

Technique decision table

Goal	Prefer	Avoid
Minimize decode inter-token latency	CUDA graphs with bucket warmup	Eager decode at scale
Variable prompt prefill TTFT	Chunked eager prefill	Graph capture on prefill
Debugging numerics / NaNs	`enforce-eager` temporarily	Graphs hiding kernel order bugs
Highly dynamic batch (0–200 spikes)	More buckets + padding tolerance	Single-bucket graph only
VRAM-constrained small GPU	Fewer, smaller max buckets	Full 1–128 bucket ladder
Multi-tenant prefix sharing	Graphs + RadixAttention path unified	Eager fallback on cache hit
Cold start after deploy	Bucket warmup in readiness probe	Traffic before capture completes
torch.compile experimentation	Pick compile or graphs per path	Stacking both on same decode without testing

Common pitfalls

Graphs left off in production. A debug flag from weeks ago silently caps throughput.
No bucket warmup. First users after deploy pay capture latency spikes mistaken for model slowness.
Bucket ladder too sparse. Frequent eager fallback between 32 and 64 when typical batch is 40.
Ignoring padding cost. Padding 3 active sequences to 64 wastes more than eager for tiny batches.
Mixing prefill and decode metrics. TTFT improvements from prefill tuning do not prove decode graphs work.
VRAM surprise after capture. Each bucket reserves scratch; OOM appears only under peak concurrency.
Prefix-cache eager branch. Shared system prompts force fallback and tank hit rate.
Upgrade without re-benchmark. Engine changes alter capture rules; graphs may disable on new dtypes.

Production checklist

Confirm CUDA graphs enabled in production config (not enforce-eager).
Define bucket ladder aligned to P95 concurrent decode batch size.
Run bucket warmup in readiness probe before accepting traffic.
Instrument decode_graph_hit_rate and alert below 90%.
Track inter-token latency separately from TTFT on dashboards.
Profile with Nsight if GPU util is low but latency high — check launches.
Verify prefix-cache code paths do not disable graphs on hits.
Measure VRAM after full bucket capture, not only after weight load.
Re-benchmark after vLLM or CUDA driver upgrades.
Document bucket ladder and max_num_seqs in runbooks for on-call.

Key takeaways

CUDA graphs cut CPU kernel-launch overhead on decode by recording a fixed operator chain once and replaying it in a single dispatch.
Graphs fit autoregressive decode with stable per-step shapes; prefill stays eager or chunked because prompt lengths vary.
Batch-size buckets, padding policy, and warmup determine graph hit rate — sparse buckets cause eager fallback.
PagedAttention and prefix caching work with graphs when block-table kernels stay on the captured path.
Harbor Analytics raised decode throughput 38% by enabling graphs, pre-warming eight buckets, and fixing prefix-cache eager fallback.