Guide

LLM admission control and request queuing explained

Harbor Analytics shipped priority lanes and chunked prefill after RAG traffic spiked TTFT for short queries. p95 time-to-first-token recovered to 720 ms — until Black Friday, when a partner embedded the API in a checkout upsell widget. Request volume tripled. No one hit RPM limits; every call was under quota. Yet p95 TTFT climbed past 8 s, GPU memory pinned at 98%, and cancel events piled up because users abandoned slow streams. The scheduler inside vLLM was doing its job; the problem was admission: too many sequences entered the inference queue at once.

Admission control is the gateway layer that decides whether a request may join the GPU work queue, how long it may wait, and which lane it occupies. It sits above rate limiting (which caps averages over time) and below the engine scheduler (which batches tokens once work is admitted). This guide covers queue policies, token budgets at the door, backpressure vs hard rejects, priority and fairness, the Harbor Analytics gateway refactor, a technique decision table, pitfalls, and a production checklist.

What admission control decides

An LLM inference path has three choke points. Rate limits protect the API surface from sustained abuse. Admission control bounds concurrent and queued work against finite GPU memory and scheduler capacity. Engine scheduling (continuous batching, chunked prefill) optimizes tokens per forward pass for work already admitted.

Without admission control, a burst of 200 concurrent RAG requests — each with 12K-token prompts — can exhaust KV cache blocks before any request completes. Users see spinning loaders instead of fast 429s; cancel storms waste partial prefills; and multi-tenant fairness collapses because everyone shares one overloaded queue.

Concurrency caps — max in-flight requests per tenant, API key, or global pool.
Queue depth — max waiting requests before reject or shed.
Wait-time budgets — dequeue or fail if TTFT SLO would be violated before processing starts.
Token admission — reject or truncate prompts whose estimated KV footprint cannot fit.

Good admission returns a clear signal: HTTP 429 with Retry-After, a queue position header, or a degraded fallback — not an unbounded wait that looks like a hung connection.

Queue policies: FIFO, priority, and weighted fair queuing

First-in, first-out (FIFO)

Simplest policy: requests enter a single queue and dequeue in arrival order. Fair in the abstract, but dangerous for mixed workloads. One tenant uploading a 40K-token legal brief blocks interactive chat behind it unless the engine uses chunked prefill — and even then, queue wait time grows linearly with depth.

Priority lanes

Tag requests (P0 interactive, P1 standard RAG, P2 batch) and maintain separate queues or strict priority dequeue. P0 always jumps ahead. Risk: starvation if P2 never runs; mitigate with aging (boost priority after N seconds waiting) or reserved capacity fractions (e.g. 70% P0/P1, 30% P2 minimum).

Weighted fair queuing (WFQ)

Each tenant or lane gets a weight; dequeue probability proportional to weight. Prevents one enterprise customer from monopolizing GPUs while still allowing burst capacity. Pairs naturally with per-tenant RPM/TPM limits and concurrency caps.

Shortest-job-first (SJF) approximations

Prefer admitting short prompts when queue depth is high. Estimate job size from prompt token count and max completion tokens. Improves median TTFT under overload at the cost of fairness for long-document jobs — disclose in SLA tiers.

Token budgets and memory-aware admission

Request-count limits are insufficient. Two requests with 500-token prompts and two with 30K-token RAG contexts consume vastly different KV cache memory. Admission should estimate:

prompt_tokens + max_tokens per sequence against max_model_len.
Aggregate KV blocks required vs free blocks in the PagedAttention pool.
Reserved headroom (10–20%) for decode growth and new admissions.

When estimated blocks exceed free capacity, options include: reject with 413/429, queue until slots free, truncate retrieved context upstream, or route to a cold tier with smaller concurrency. Admitting work the engine cannot finish causes OOM kills that harm all tenants — worse than a fast reject.

Gateways like LiteLLM, BentoML, and custom Envoy/Go middleware often implement token-aware admission; vLLM exposes max_num_seqs as the last line of defense inside the engine, not a substitute for gateway policy.

Backpressure, shedding, and client behavior

Backpressure slows producers when consumers are saturated: return 429, increase Retry-After, or switch to async batch endpoints for non-interactive work. Load shedding drops lowest-priority work under extreme overload — cancel queued P2 batch jobs before evicting P0 chat.

Clients must cooperate. Exponential backoff with jitter on 429 prevents retry storms that amplify outages. Streaming clients should propagate user cancel so admitted work releases GPU slots immediately. SDKs that auto-retry without reading Retry-After are a common post-mortem finding.

Distinguish queue wait from inference time in metrics. Harbor discovered 6 s of their 8 s p95 TTFT was queueing before the first prefill token — invisible when only end-to-end latency was charted. See inference observability for span breakdowns.

How admission interacts with engine scheduling

Admission and scheduling are complementary:

Admission bounds how many sequences compete for KV blocks and scheduler iterations.
Continuous batching packs admitted sequences into each forward pass efficiently.
Chunked prefill prevents one admitted long prompt from starving decode within the batch.

Raising max_num_seqs without gateway admission increases throughput until memory collapses. Tight admission with well-tuned continuous batching often yields better p99 TTFT than loose admission on the same hardware.

Autoscaling adds replicas when queue depth or GPU utilization crosses thresholds — but scale-up takes minutes; admission must protect the current fleet during spikes.

Harbor Analytics gateway refactor

Harbor’s Black Friday incident drove a structured admission redesign:

Measure queue wait separately — OpenTelemetry spans: queue_wait_ms, prefill_ms, decode_ms. Confirmed 75% of TTFT regression was queue depth, not model slowness.
Global concurrency cap — 48 concurrent sequences per A100 replica (down from implicit unlimited via LiteLLM passthrough). Additional requests receive 429 with Retry-After: 2.
Per-tenant caps — Enterprise: 12 concurrent; Pro: 4; Free: 1. Prevents partner widget from consuming entire pool.
Priority lanes with aging — P0 checkout widget tagged interactive; P2 nightly reports aged to P1 after 120 s queue wait.
Token admission — Reject prompts >16K tokens at gateway with actionable error; suggest async batch endpoint for bulk summarization.
Validation — p95 TTFT under load restored to 1.1 s; 429 rate 3.2% at peak (acceptable); zero OOM events over 72-hour stress test.

The lesson: rate limits alone do not protect GPU memory. Admission must be concurrency-, token-, and priority-aware.

Technique decision table

Your situation	Prefer	Avoid
Single-tenant internal tool, fixed load	Simple FIFO + modest concurrency cap	Complex WFQ with no multi-tenant need
Multi-tenant SaaS API	Per-tenant concurrency + WFQ + priority lanes	Global FIFO only
Mixed short chat + long RAG	Token-aware admission + priority + chunked prefill	Request-count limits without token estimates
Strict TTFT SLO (<1 s)	Low queue depth, wait-time dequeue, SJF bias	Deep queues that hide overload
Batch-heavy overnight jobs	Separate batch tier or P2 lane with reserved capacity	Same queue as interactive without tagging
Provider API only (OpenAI, Anthropic)	Client-side concurrency pools + 429 backoff	Assuming you control their GPU admission
Autoscaling Kubernetes fleet	Admission protects current pods; HPA on queue depth	Unbounded queue while pods scale up

Common pitfalls

Rate limiting without admission control. RPM under quota but 500 concurrent requests melt KV memory.
Unbounded queue depth. Users wait 30 s then cancel; partial prefills waste GPU cycles.
No queue-wait metrics. TTFT regressions blamed on the model.
Priority without aging. Batch jobs starve indefinitely.
Ignoring cancel propagation. Abandoned streams hold slots until timeout.
Same admission for all model sizes. 70B needs tighter caps than 8B on identical hardware.
Retry storms on 429. Clients ignore Retry-After and amplify overload.
Raising max_num_seqs as the only fix. Defers OOM; does not fix fairness or TTFT.

Production checklist

Implement global and per-tenant concurrency caps at the gateway.
Set max queue depth with clear 429 or 503 responses and Retry-After.
Estimate KV footprint from prompt + max completion tokens before admission.
Tag priority lanes (interactive, standard, batch) with aging for fairness.
Instrument queue_wait_ms separately from prefill and decode latency.
Propagate client cancel to release GPU slots immediately.
Document 429 behavior in API docs; test SDK backoff under synthetic overload.
Reserve KV headroom; never admit to 100% block utilization.
Pair admission with chunked prefill and continuous batching tuning.
Load-test burst scenarios (3× normal traffic for 10 minutes).
Autoscale on queue depth and GPU memory, not CPU alone.
Review admission limits after model or context-length upgrades.

Key takeaways

Admission control bounds which requests enter GPU inference; rate limits cap averages over time — both are required.
Token-aware admission prevents KV memory exhaustion that request-count limits miss.
Queue wait often dominates TTFT under overload; measure it separately from model time.
Priority lanes with aging protect interactive SLOs without starving batch work.
Harbor Analytics restored sub-second p95 TTFT at 3× traffic by capping concurrency and rejecting oversize prompts at the gateway.