Guide

LLM inference warmup and cold start explained

Harbor Analytics rolled out a third chat replica during a quiet Tuesday window. Kubernetes marked the pod Ready after the health endpoint returned 200, and the load balancer immediately sent traffic. The first twelve real user sessions on that replica saw time-to-first-token (TTFT) above eight seconds — four times the cluster baseline — while GPU utilization sat near zero. Support tickets spiked before anyone noticed the pod had passed a TCP probe but never finished inference warmup. Weight tensors were still streaming from object storage, CUDA kernels were compiling on the first matmul shapes they encountered, and vLLM had not yet captured the CUDA graphs used for steady-state decode.

Cold start in LLM serving is the gap between “process is alive” and “first production request behaves like the tenth thousandth.” It is distinct from queueing delay or chunked prefill fairness issues — it happens even on an empty replica. This guide covers weight loading, kernel and graph warmup, readiness vs liveness probes, rollout patterns for data parallel pools, vLLM and gateway configuration, the Harbor Analytics deploy refactor, a technique decision table, pitfalls, and a production checklist.

What cold start actually costs

Cold start latency stacks several one-time or rare penalties before the model reaches steady state:

  • Weight load — reading multi-hundred-gigabyte checkpoints from NVMe, network-attached storage, or S3-compatible object stores into GPU HBM. Quantized FP8 or INT4 weights load faster but still dominate startup on 70B+ models.
  • Process and framework init — Python import time, CUDA context creation, NCCL process groups for tensor parallel replicas, tokenizer and config parsing.
  • Kernel compilation (JIT) — PyTorch/Triton/cuBLAS may compile specialized kernels the first time a particular batch size, sequence length, or head dimension appears. The penalty repeats when shapes change.
  • CUDA graph capture — engines like vLLM capture decode graphs for fixed token-batch shapes. First requests at each bucket pay graph-build cost unless you pre-warm those buckets.
  • KV cache allocator warmup — PagedAttention pools and memory arenas grow on first allocations; occasional page faults show up as TTFT spikes.
  • Compilation passestorch.compile or vendor-specific graph optimizers add minutes on first boot if enabled without a persisted cache.

Users experience this as a single slow first reply or a burst of slow replies after deploy, scale-out, or replica restart — even when admission control shows an empty queue.

Warmup vs readiness: do not confuse them

A liveness probe answers: is the process hung? A readiness probe answers: should the load balancer send traffic? For LLM workers, readiness must mean “model loaded and warmup requests completed,” not merely “HTTP port open.”

Readiness patterns that work

  • Internal warmup job — on startup, the worker runs synthetic prefill+decode at representative context lengths (512, 2k, 8k tokens) and exits the initializing state only after P95 TTFT falls below a threshold.
  • Gradual traffic ramp — register the replica with weight 0 in the gateway, run canary traffic internally, then increase weight. Pair with canary deploy practices.
  • Pre-warmed standby — keep N+1 replicas hot; swap DNS or gateway backend lists during rollouts so users never hit a cold process.
  • Persistent compilation cache — mount a shared volume for Triton/torch compile artifacts so new pods skip recompilation when shapes match.

Harbor Analytics changed readiness from GET /healthz to GET /ready which returns 503 until an internal warmup_complete flag is set. The flag flips only after three synthetic chats at 512, 2048, and 8192 prompt tokens complete with TTFT under 1.2× the fleet median.

Weight loading strategies

Load time scales with bytes moved and PCIe/NVLink bandwidth, not parameter count alone. Production teams optimize along these axes:

  • Local NVMe cache — pull weights from object storage once per node into a daemon-managed cache; replicas on the same machine mmap the same files. Critical when running multiple DP replicas per host.
  • Staggered replica startup — launching four replicas simultaneously can saturate disk and network; serialize or jitter starts by 30–90 seconds.
  • Quantized checkpointsFP8 or INT4 artifacts cut load bytes roughly 2×–4× with acceptable quality on many chat workloads.
  • Tensorizer / serialized formats — vendor-specific serialized GPU-ready blobs skip Python unpickling overhead compared to raw safetensors on some stacks.
  • Lazy layer load — rare in latency-sensitive chat but useful for batch jobs: load layers on first use to shorten time-to-first-byte of process boot at the cost of first-request latency.

Measure time from pod schedule to ready separately from time from ready to steady TTFT. Autoscaling on queue depth is useless if new pods need four minutes before they absorb work.

CUDA graphs and shape-bucket warmup

vLLM and similar engines capture CUDA graphs for decode steps at discrete batch sizes (number of concurrent sequences in the iteration). The first request that lands in bucket B pays graph capture; subsequent requests in bucket B reuse the graph.

Warmup should exercise the buckets you expect in production:

  1. Run single-sequence decode (batch 1) — matches low-concurrency chat.
  2. Run batch sizes at 25th, 50th, and 95th percentile of your continuous batching scheduler depth.
  3. Vary prompt lengths so prefill kernels for common context tiers compile once.
  4. If using speculative decoding, warm both draft and target model paths.

Disable or defer CUDA graphs during development if you need dynamic shapes every request; in production, graphs are almost always worth the warmup investment for decode-heavy workloads.

Rollout patterns for replica pools

Cold start bites hardest during deploys and autoscale events. Patterns that hide it:

  • Blue/green with pre-warm — stand up the green pool, complete warmup, flip gateway weights, drain blue. Zero user-facing cold start.
  • Rolling update with maxUnavailable=0 — only remove an old replica after its replacement reports ready and warmup-complete.
  • Scale-ahead — add replicas before predictable spikes (product launch, Black Friday) so cold pods finish warming before demand arrives.
  • Pod disruption budgets — prevent Kubernetes from evicting half the fleet during node maintenance without pre-warmed replacements.
  • Connection draining — long-lived SSE streams pin to a replica; draining must wait for streams to end before killing GPUs mid-session.

Harbor Analytics now runs a pre-deploy job that starts one shadow replica, waits for warmup_complete, snapshots NVMe cache checksums, then rolls the fleet with 120-second stagger. P95 TTFT during deploys dropped from 6.1s to 1.4s.

Harbor Analytics deploy refactor (worked example)

Problem: Chat gateway used naive round-robin across replicas. New pods joined the pool as soon as uvicorn listened. TTFT SLO violations clustered in the ten minutes after every deploy.

Changes:

  1. Added /ready gated on synthetic warmup (three prompt lengths, ten decode tokens each).
  2. Node-local weight cache populated by an init container before the vLLM process starts.
  3. Gateway backend registration delayed until readiness passes; initial weight set to 5% for five minutes, then 100%.
  4. Dashboards split pod ready time, warmup duration, and first-user TTFT per replica using labels from observability hooks.

Outcome: Deploy-window TTFT regressions fell below 5% of sessions; scale-out events no longer trigger support escalations. Weight cache hit rate on multi-replica nodes exceeded 94%.

Technique decision table

Scenario Prefer Avoid
Latency-sensitive chat API Full warmup + strict readiness; blue/green deploys TCP-only health checks; simultaneous multi-replica boot
Batch overnight jobs Lazy load acceptable; amortize cold start over large batch Expensive graph warmup for every shape bucket
Autoscale on queue depth Scale-ahead + warmup-aware ready probe; cooldown after ready Instant 100% traffic to new pods
Multi-replica per GPU node Shared NVMe weight cache; staggered process start Four parallel S3 downloads of the same 140GB checkpoint
Frequent shape changes (dynamic batch) Bounded bucket set + warmup those buckets CUDA graphs without bucket discipline
Serverless / scale-to-zero Smaller model tier, persistent warm minimum, or accept cold penalty 70B scale-to-zero without user-facing latency disclosure
torch.compile enabled Persisted compile cache on shared volume Fresh compile on every pod restart

Common pitfalls

  • Health check equals model ready. Port-open probes hide multi-minute weight loads.
  • Warmup with trivial prompts only. A 16-token synthetic prompt does not compile 8k-context prefill kernels.
  • Ignoring decode batch buckets. Warming batch-1 only leaves spikes when continuous batching reaches depth 16.
  • Thundering herd on object storage. Fleet-wide deploys without stagger can throttle S3 and extend everyone’s cold start.
  • Killing pods mid-warmup. Aggressive liveness probes during long weight loads restart pods in a crash loop.
  • No per-replica TTFT metric. Fleet averages mask a single cold replica poisoning round-robin.
  • Autoscale too reactive. Scaling down then immediately up replays full cold start; use hysteresis.
  • Assuming compile cache portability. CUDA driver or GPU SKU changes invalidate cached graphs — re-warm after hardware migrations.

Production checklist

  • Define readiness as model-loaded plus warmup-complete, not process-started.
  • Run synthetic prefill+decode at 25th/50th/95th percentile context lengths.
  • Warm CUDA graph buckets matching expected continuous-batching depths.
  • Cache weights on local NVMe; share cache across co-located replicas.
  • Stagger replica starts during deploys to avoid storage/network saturation.
  • Use blue/green or canary weights; ramp traffic only after warmup passes.
  • Track pod-ready time, warmup duration, and first-real-user TTFT separately.
  • Set liveness probe timeouts longer than worst-case weight load on your SKU.
  • Persist torch/Triton compile artifacts when using compilation passes.
  • Load-test deploy and scale-out paths, not just steady-state throughput.
  • Document cold-start budget in SLO docs so product sets honest latency expectations.
  • Align autoscale cooldown with measured warmup time plus safety margin.

Key takeaways

  • Cold start is the latency tax between process boot and steady-state inference — weight load, JIT kernels, and CUDA graphs all contribute.
  • Readiness probes must gate traffic until warmup requests succeed at production-like shapes.
  • Data parallel scale-out multiplies cold starts unless you cache weights and stagger replica boot.
  • Harbor Analytics cut deploy-window TTFT spikes by pairing NVMe weight cache with warmup-gated readiness and gradual gateway weights.
  • Measure per-replica first-token latency; fleet averages hide poisoned cold replicas in round-robin pools.

Related reading