Guide
LLM inference SLO and capacity planning explained
Harbor Analytics launched a marketing campaign that tripled concurrent chat sessions on a Tuesday afternoon. GPU utilization on the 70B fleet peaked at 62% — well below the 85% alert threshold — yet p95 time-to-first-token (TTFT) blew past four seconds and support tickets flooded in. The dashboards looked healthy because the team had sized capacity on tokens per second alone, ignoring queue wait, long-context KV memory pressure, and the gap between average batch depth and worst-case prefill shapes. Capacity planning for LLM inference is not the same as sizing a stateless REST API.
An inference SLO is a user-visible latency or availability promise backed by measurable signals: TTFT, inter-token latency (ITL), end-to-end response time, error rate, and queue wait. Capacity planning translates those SLOs into GPU count, memory headroom, concurrency limits, and autoscale triggers. This guide covers metric definitions, Little’s law for queue sizing, KV memory budgeting, throughput vs latency trade-offs, burst headroom, admission control alignment, the Harbor Analytics Black Friday refactor, a technique decision table, pitfalls, and a production checklist.
The metrics that actually matter
Raw GPU utilization is a poor primary SLO driver. A replica can sit at 40% util while requests queue because prefill kernels monopolize SMs or KV pages are exhausted. Track these layers separately:
- TTFT (time to first token) — wall clock from request accepted to first streamed token. Includes queue wait, prefill compute, and scheduler overhead. Interactive chat SLOs often target p95 TTFT under 800ms–2s depending on model size.
- ITL / TPOT (inter-token latency) — median milliseconds between consecutive output tokens during decode. Users perceive “stutter” when ITL spikes even if TTFT was fine.
- End-to-end latency — full response time including all tokens. Dominated by output length; pair with max-tokens policies.
- Queue wait — time spent in gateway or engine queue before GPU work starts. Often the hidden killer during bursts.
- Tokens per second (throughput) — aggregate output tokens/sec across replicas. Useful for cost and fleet sizing, not user experience alone.
- Prefill tokens per second — input-side throughput; long-context RAG queries are prefill-heavy and can starve decode even at moderate util.
- KV cache utilization — fraction of GPU memory allocated to PagedAttention blocks. Hitting 95%+ triggers preemption, batch shrinking, or OOM.
- Batch depth — concurrent sequences in the continuous batching iteration. Higher depth raises throughput but can worsen ITL for any single stream.
Define SLOs per product surface: a copilot sidebar and a batch document summarizer should not share the same p95 TTFT target. Document error budgets (e.g. 0.1% of sessions may exceed 3s TTFT per month) so on-call knows when to freeze deploys.
Little’s law and queue sizing
Little’s law states L = λ × W: average items in
system equals arrival rate times average time in system. For inference:
L= average concurrent in-flight requests (queued + GPU-active)λ= new requests per secondW= average end-to-end latency per request
If chat receives 50 req/s and each request takes 6s wall-to-wall, you need capacity
for ~300 concurrent sessions on average — before burst multipliers. Queue depth
explodes when λ spikes but GPU throughput (tokens/sec) is fixed.
Practical queue targets
- Cap gateway queue depth so p95 queue wait stays under 10–15% of TTFT SLO.
- When queue depth exceeds threshold for 30s, trigger scale-out — not when GPU hits 90%, which may be too late after accounting for cold start.
- Separate prefill and decode queues if using disaggregated serving; each tier has its own Little’s law balance.
GPU memory budgeting
VRAM is the hard constraint that caps concurrent long-context sessions. A simplified budget:
VRAM ≈ model_weights + (kv_bytes_per_token × context_length × concurrent_sequences) + activations_overhead
For a 70B model in FP16, weights alone consume ~140GB before KV. Tensor parallelism splits weights across GPUs but KV still scales with total concurrent tokens. Planning steps:
- Measure kv_bytes_per_token for your model and TP degree using engine metrics (vLLM exposes KV cache usage).
- Estimate concurrent sequences at p95 context length — not average. RAG workloads often have bimodal lengths (short chat + 32k doc paste).
- Reserve 10–20% VRAM headroom for fragmentation, graph capture, and batch growth. Running at 98% KV utilization guarantees tail latency spikes.
- Model the effect of FP8 weights (smaller footprint, more room for KV) vs quality trade-offs.
If memory is the bottleneck, adding GPUs via data parallel replicas increases concurrent slot count linearly; tensor parallel reduces per-GPU weight footprint but does not multiply KV capacity per replica.
Throughput vs latency trade-offs
Serving engines expose knobs that shift the Pareto frontier:
- max_num_seqs / max concurrency — higher values improve fleet tokens/sec but widen ITL variance.
- max_num_batched_tokens — larger batches improve prefill efficiency; pair with chunked prefill to avoid starving short requests.
- Speculative decoding — raises effective tokens/sec at cost of extra draft-model VRAM and verification overhead.
- Quantization — frees memory for more concurrent slots or allows smaller GPU SKUs.
Load-test at expected concurrency with realistic prompt distributions. A benchmark that only runs batch-1, 512-token prompts will overstate capacity for production RAG.
Autoscaling and burst headroom
Reactive autoscale on CPU or GPU util fails for LLMs because:
- New replicas need minutes to load weights and warm up.
- Queue depth rises before util does — requests wait while GPUs look idle.
- Scale-down too aggressive replays full cold start on the next uptick.
Signals that work better
- p95 TTFT or queue wait exceeding SLO for N consecutive minutes
- KV cache utilization above 80% sustained
- Admission reject rate (429) trending up
- Scheduler batch depth pegged at max_num_seqs
Maintain burst headroom: if steady state needs 4 replicas, run 5 during business hours or keep N+1 warm. For predictable events (launches, Black Friday), scale-ahead 24–48 hours and validate with synthetic load, not spreadsheet extrapolation.
Harbor Analytics Black Friday refactor (worked example)
Problem: Chat SLO was “p95 TTFT < 1.5s.” Capacity plan assumed 2,000 tokens/sec fleet-wide throughput based on offline benchmarks. Campaign traffic hit 1,400 concurrent sessions with 40% of prompts above 8k tokens. TTFT p95 reached 5.2s; GPUs showed 58% average util.
Root causes:
- Queue wait averaged 2.1s but was not in the SLO dashboard.
- KV cache hit 97% on two replicas; scheduler preempted batches, doubling ITL.
- Autoscale triggered at 85% GPU util with 6-minute cold-start penalty.
- No per-tenant concurrency cap; one API partner flooded the shared pool.
Changes:
- Redefined SLO as
TTFT = queue_wait + engine_TTFTwith separate alerts on each component. - Added KV cache utilization and batch depth to observability dashboards.
- Set
max_num_seqsper replica from memory budget at p95 context, not peak throughput lab tests. - Implemented tiered admission control: soft queue at 200 depth, hard 429 at 400, partner-specific quotas.
- Scale-ahead job added two replicas 36 hours before known traffic spikes; autoscale now uses queue wait + TTFT, with 15-minute cooldown after warmup.
Outcome: Next major campaign held p95 TTFT at 1.3s with peak 2,100 concurrent sessions. Fleet tokens/sec actually dropped 8% (lower max_num_seqs) but user-visible latency improved because queue wait fell 74%.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Interactive chat SLO | TTFT + ITL SLOs; queue wait in dashboard; N+1 headroom | Tokens/sec-only capacity model |
| Long-context RAG | KV memory budget at p95 context; chunked prefill; context limits | Average context length in planning |
| Predictable traffic spike | Scale-ahead + load test; warmup-gated replicas | Reactive autoscale on GPU util alone |
| Multi-tenant API | Per-tenant concurrency quotas; fair queuing | Single shared queue with no isolation |
| Cost-constrained fleet | Quantization + right-sized max_num_seqs; model routing to smaller SLM | Over-provisioning without SLO measurement |
| Batch overnight jobs | Separate pool; maximize throughput; loose TTFT SLO | Sharing GPUs with latency-sensitive chat |
Common pitfalls
- SLO on engine TTFT only. Gateway queue wait is invisible until users complain.
- Benchmarking at batch-1. Production continuous batching behaves differently.
- Planning on mean context length. A few 32k prompts can exhaust KV while averages look fine.
- Ignoring cold start in autoscale. New pods join too late and too cold.
- 100% KV utilization targets. Leave headroom for batch growth and fragmentation.
- One SLO for all tenants. Batch and chat need different pools or priorities.
- GPU util as sole scale signal. Queue depth and TTFT lead util by minutes.
- No error budget process. Teams ship features while SLO has been red for weeks.
Production checklist
- Define p50/p95/p99 TTFT, ITL, and end-to-end latency SLOs per product surface.
- Measure queue wait separately from engine compute time.
- Dashboard KV cache utilization, batch depth, and prefill vs decode tokens/sec.
- Build capacity model: Little’s law concurrency at peak λ and W.
- Budget VRAM at p95 context length with 10–20% headroom.
- Load-test with realistic prompt length distribution and concurrency.
- Set max_num_seqs from memory budget, not lab throughput peaks.
- Align admission control queue limits with SLO math.
- Autoscale on TTFT and queue depth; include warmup time in cooldown.
- Run N+1 or scale-ahead before known traffic events.
- Track error budget burn; gate deploys when SLO is violated.
- Revisit plan quarterly as model, context limits, or traffic mix changes.
Key takeaways
- LLM capacity is constrained by KV memory and queue dynamics, not GPU utilization alone.
- SLOs must include queue wait; engine-only TTFT hides gateway bottlenecks.
- Little’s law links arrival rate, concurrency, and latency — use it for queue and replica sizing.
- Harbor Analytics fixed campaign outages by lowering max_num_seqs, tracking KV pressure, and scaling ahead on TTFT — not tokens/sec.
- Autoscale triggers should lead GPU util: queue wait, TTFT, and KV utilization fire earlier and more reliably.
Related reading
- LLM inference serving explained — end-to-end serving stack
- LLM admission control and request queuing explained — backpressure and fair queuing
- LLM observability explained — traces, token metrics, and dashboards
- LLM inference warmup and cold start explained — autoscale and deploy latency