Guide

LLM inference SLO and capacity planning explained

Harbor Analytics launched a marketing campaign that tripled concurrent chat sessions on a Tuesday afternoon. GPU utilization on the 70B fleet peaked at 62% — well below the 85% alert threshold — yet p95 time-to-first-token (TTFT) blew past four seconds and support tickets flooded in. The dashboards looked healthy because the team had sized capacity on tokens per second alone, ignoring queue wait, long-context KV memory pressure, and the gap between average batch depth and worst-case prefill shapes. Capacity planning for LLM inference is not the same as sizing a stateless REST API.

An inference SLO is a user-visible latency or availability promise backed by measurable signals: TTFT, inter-token latency (ITL), end-to-end response time, error rate, and queue wait. Capacity planning translates those SLOs into GPU count, memory headroom, concurrency limits, and autoscale triggers. This guide covers metric definitions, Little’s law for queue sizing, KV memory budgeting, throughput vs latency trade-offs, burst headroom, admission control alignment, the Harbor Analytics Black Friday refactor, a technique decision table, pitfalls, and a production checklist.

The metrics that actually matter

Raw GPU utilization is a poor primary SLO driver. A replica can sit at 40% util while requests queue because prefill kernels monopolize SMs or KV pages are exhausted. Track these layers separately:

  • TTFT (time to first token) — wall clock from request accepted to first streamed token. Includes queue wait, prefill compute, and scheduler overhead. Interactive chat SLOs often target p95 TTFT under 800ms–2s depending on model size.
  • ITL / TPOT (inter-token latency) — median milliseconds between consecutive output tokens during decode. Users perceive “stutter” when ITL spikes even if TTFT was fine.
  • End-to-end latency — full response time including all tokens. Dominated by output length; pair with max-tokens policies.
  • Queue wait — time spent in gateway or engine queue before GPU work starts. Often the hidden killer during bursts.
  • Tokens per second (throughput) — aggregate output tokens/sec across replicas. Useful for cost and fleet sizing, not user experience alone.
  • Prefill tokens per second — input-side throughput; long-context RAG queries are prefill-heavy and can starve decode even at moderate util.
  • KV cache utilization — fraction of GPU memory allocated to PagedAttention blocks. Hitting 95%+ triggers preemption, batch shrinking, or OOM.
  • Batch depth — concurrent sequences in the continuous batching iteration. Higher depth raises throughput but can worsen ITL for any single stream.

Define SLOs per product surface: a copilot sidebar and a batch document summarizer should not share the same p95 TTFT target. Document error budgets (e.g. 0.1% of sessions may exceed 3s TTFT per month) so on-call knows when to freeze deploys.

Little’s law and queue sizing

Little’s law states L = λ × W: average items in system equals arrival rate times average time in system. For inference:

  • L = average concurrent in-flight requests (queued + GPU-active)
  • λ = new requests per second
  • W = average end-to-end latency per request

If chat receives 50 req/s and each request takes 6s wall-to-wall, you need capacity for ~300 concurrent sessions on average — before burst multipliers. Queue depth explodes when λ spikes but GPU throughput (tokens/sec) is fixed.

Practical queue targets

  • Cap gateway queue depth so p95 queue wait stays under 10–15% of TTFT SLO.
  • When queue depth exceeds threshold for 30s, trigger scale-out — not when GPU hits 90%, which may be too late after accounting for cold start.
  • Separate prefill and decode queues if using disaggregated serving; each tier has its own Little’s law balance.

GPU memory budgeting

VRAM is the hard constraint that caps concurrent long-context sessions. A simplified budget:

VRAM ≈ model_weights + (kv_bytes_per_token × context_length × concurrent_sequences) + activations_overhead

For a 70B model in FP16, weights alone consume ~140GB before KV. Tensor parallelism splits weights across GPUs but KV still scales with total concurrent tokens. Planning steps:

  1. Measure kv_bytes_per_token for your model and TP degree using engine metrics (vLLM exposes KV cache usage).
  2. Estimate concurrent sequences at p95 context length — not average. RAG workloads often have bimodal lengths (short chat + 32k doc paste).
  3. Reserve 10–20% VRAM headroom for fragmentation, graph capture, and batch growth. Running at 98% KV utilization guarantees tail latency spikes.
  4. Model the effect of FP8 weights (smaller footprint, more room for KV) vs quality trade-offs.

If memory is the bottleneck, adding GPUs via data parallel replicas increases concurrent slot count linearly; tensor parallel reduces per-GPU weight footprint but does not multiply KV capacity per replica.

Throughput vs latency trade-offs

Serving engines expose knobs that shift the Pareto frontier:

  • max_num_seqs / max concurrency — higher values improve fleet tokens/sec but widen ITL variance.
  • max_num_batched_tokens — larger batches improve prefill efficiency; pair with chunked prefill to avoid starving short requests.
  • Speculative decoding — raises effective tokens/sec at cost of extra draft-model VRAM and verification overhead.
  • Quantization — frees memory for more concurrent slots or allows smaller GPU SKUs.

Load-test at expected concurrency with realistic prompt distributions. A benchmark that only runs batch-1, 512-token prompts will overstate capacity for production RAG.

Autoscaling and burst headroom

Reactive autoscale on CPU or GPU util fails for LLMs because:

  • New replicas need minutes to load weights and warm up.
  • Queue depth rises before util does — requests wait while GPUs look idle.
  • Scale-down too aggressive replays full cold start on the next uptick.

Signals that work better

  • p95 TTFT or queue wait exceeding SLO for N consecutive minutes
  • KV cache utilization above 80% sustained
  • Admission reject rate (429) trending up
  • Scheduler batch depth pegged at max_num_seqs

Maintain burst headroom: if steady state needs 4 replicas, run 5 during business hours or keep N+1 warm. For predictable events (launches, Black Friday), scale-ahead 24–48 hours and validate with synthetic load, not spreadsheet extrapolation.

Harbor Analytics Black Friday refactor (worked example)

Problem: Chat SLO was “p95 TTFT < 1.5s.” Capacity plan assumed 2,000 tokens/sec fleet-wide throughput based on offline benchmarks. Campaign traffic hit 1,400 concurrent sessions with 40% of prompts above 8k tokens. TTFT p95 reached 5.2s; GPUs showed 58% average util.

Root causes:

  1. Queue wait averaged 2.1s but was not in the SLO dashboard.
  2. KV cache hit 97% on two replicas; scheduler preempted batches, doubling ITL.
  3. Autoscale triggered at 85% GPU util with 6-minute cold-start penalty.
  4. No per-tenant concurrency cap; one API partner flooded the shared pool.

Changes:

  1. Redefined SLO as TTFT = queue_wait + engine_TTFT with separate alerts on each component.
  2. Added KV cache utilization and batch depth to observability dashboards.
  3. Set max_num_seqs per replica from memory budget at p95 context, not peak throughput lab tests.
  4. Implemented tiered admission control: soft queue at 200 depth, hard 429 at 400, partner-specific quotas.
  5. Scale-ahead job added two replicas 36 hours before known traffic spikes; autoscale now uses queue wait + TTFT, with 15-minute cooldown after warmup.

Outcome: Next major campaign held p95 TTFT at 1.3s with peak 2,100 concurrent sessions. Fleet tokens/sec actually dropped 8% (lower max_num_seqs) but user-visible latency improved because queue wait fell 74%.

Technique decision table

Scenario Prefer Avoid
Interactive chat SLO TTFT + ITL SLOs; queue wait in dashboard; N+1 headroom Tokens/sec-only capacity model
Long-context RAG KV memory budget at p95 context; chunked prefill; context limits Average context length in planning
Predictable traffic spike Scale-ahead + load test; warmup-gated replicas Reactive autoscale on GPU util alone
Multi-tenant API Per-tenant concurrency quotas; fair queuing Single shared queue with no isolation
Cost-constrained fleet Quantization + right-sized max_num_seqs; model routing to smaller SLM Over-provisioning without SLO measurement
Batch overnight jobs Separate pool; maximize throughput; loose TTFT SLO Sharing GPUs with latency-sensitive chat

Common pitfalls

  • SLO on engine TTFT only. Gateway queue wait is invisible until users complain.
  • Benchmarking at batch-1. Production continuous batching behaves differently.
  • Planning on mean context length. A few 32k prompts can exhaust KV while averages look fine.
  • Ignoring cold start in autoscale. New pods join too late and too cold.
  • 100% KV utilization targets. Leave headroom for batch growth and fragmentation.
  • One SLO for all tenants. Batch and chat need different pools or priorities.
  • GPU util as sole scale signal. Queue depth and TTFT lead util by minutes.
  • No error budget process. Teams ship features while SLO has been red for weeks.

Production checklist

  • Define p50/p95/p99 TTFT, ITL, and end-to-end latency SLOs per product surface.
  • Measure queue wait separately from engine compute time.
  • Dashboard KV cache utilization, batch depth, and prefill vs decode tokens/sec.
  • Build capacity model: Little’s law concurrency at peak λ and W.
  • Budget VRAM at p95 context length with 10–20% headroom.
  • Load-test with realistic prompt length distribution and concurrency.
  • Set max_num_seqs from memory budget, not lab throughput peaks.
  • Align admission control queue limits with SLO math.
  • Autoscale on TTFT and queue depth; include warmup time in cooldown.
  • Run N+1 or scale-ahead before known traffic events.
  • Track error budget burn; gate deploys when SLO is violated.
  • Revisit plan quarterly as model, context limits, or traffic mix changes.

Key takeaways

  • LLM capacity is constrained by KV memory and queue dynamics, not GPU utilization alone.
  • SLOs must include queue wait; engine-only TTFT hides gateway bottlenecks.
  • Little’s law links arrival rate, concurrency, and latency — use it for queue and replica sizing.
  • Harbor Analytics fixed campaign outages by lowering max_num_seqs, tracking KV pressure, and scaling ahead on TTFT — not tokens/sec.
  • Autoscale triggers should lead GPU util: queue wait, TTFT, and KV utilization fire earlier and more reliably.

Related reading