Guide
LLM rate limiting and quota management explained
Harbor Support's triage copilot launched to 400 concurrent agents on a Monday. Within twenty minutes, p95 latency climbed from 1.2 seconds to 38 seconds — not because the model slowed, but because every pod independently retried HTTP 429 responses from OpenAI without a shared budget. Worse, a nightly embedding job on the same API key consumed half the tokens-per-minute (TPM) pool, starving live chat. The fix was not “buy a bigger tier” alone: the team added a central gateway with dual token and request buckets, priority lanes for interactive traffic, and hard caps on background workloads routed to the batch API instead of realtime endpoints.
Rate limiting in LLM production is unlike traditional REST APIs. Providers enforce both requests per minute (RPM) and tokens per minute (TPM) — often per model, per organization, and sometimes per project. Burst traffic, long contexts, and parallel tool calls all burn TPM faster than RPM. Without deliberate quota architecture, your app discovers limits through cascading retries and angry users. This guide covers limit semantics, client vs gateway throttling, fair multi-tenant budgets, queueing strategies, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
TPM, RPM, and why both matter
Cloud LLM providers publish tier tables with separate ceilings. Confusing them is the most common quota bug in new deployments:
| Limit type | What it counts | Typical pain point |
|---|---|---|
| TPM (tokens per minute) | Input + output tokens across requests in a rolling window | Long RAG prompts, 128k contexts, bulk classification |
| RPM (requests per minute) | HTTP completion calls regardless of size | Many small agent tool loops, high fan-out micro-calls |
| Concurrent requests | In-flight calls at once (some providers) | Streaming sessions held open; agent swarms |
| Batch quota | Separate pool for async JSONL jobs | Backfills that should never touch realtime TPM |
A request can be under RPM but blow TPM — one 80k-token document summary blocks dozens of short chats if they share a bucket. Conversely, a thousand tiny classification calls can hit RPM while TPM looks fine. Your limiter must track both dimensions and estimate tokens before dispatch when possible (tiktoken, provider tokenizers, or conservative heuristics).
Throttling algorithms: token bucket vs leaky bucket
Most production gateways implement a token bucket (or dual buckets for RPM and TPM):
- Token bucket — capacity refills at a steady rate; bursts consume accumulated tokens up to the cap. Matches provider “per minute with burst” behavior well.
- Leaky bucket — requests drip out at fixed rate; smoother output, less burst-friendly. Useful for protecting downstream GPUs on self-hosted stacks.
- Sliding window log — exact counts over the last 60 seconds; higher memory cost but precise when provider windows are strict.
Client libraries often ship naive semaphores (“max 10 parallel”) that ignore TPM entirely. Replace them with a shared limiter service (Redis, in-memory on a single gateway, or embedded in LiteLLM) that decrements estimated tokens on enqueue and reconciles with actual usage from response metadata.
Pre-flight estimation vs post-hoc accounting
Pre-flight: reject or queue when estimated_input + max_output exceeds
remaining TPM — prevents 429 storms. Post-hoc: adjust bucket with
usage.prompt_tokens + usage.completion_tokens from the response; correct
drift when estimates were wrong. Log both estimate error and 429 rate in
observability
dashboards.
Priority queues and traffic classes
Flat FIFO queues treat a bulk re-embed job the same as a paying user's chat turn. Production systems define traffic classes with weighted fair queuing:
- Interactive (P0) — user-visible chat, copilot, live agent turns. Reserved TPM headroom (e.g. 60% of org budget).
- Near-realtime (P1) — webhooks, email draft generation, sub-30s SLA. Borrows spare capacity; yields instantly to P0.
- Background (P2) — evals, analytics, index rebuilds. Hard cap; overflow spills to async batch or off-peak windows.
Implement with separate Redis streams per class, a scheduler that always drains P0
before P1, and max_wait_ms deadlines — if P1 cannot start within
SLA, fail fast with a clear message rather than wedging behind P2. For multi-tenant
SaaS, nest per-tenant sub-buckets inside each class so one noisy customer cannot exhaust
the org-wide TPM.
Gateway patterns: centralize or suffer
Every microservice holding its own API key and retry loop multiplies 429s. Standard pattern:
- Single egress gateway — all LLM calls flow through one service (or LiteLLM proxy) that owns rate state.
- Key rotation and model routing — secondary keys or fallback models with independent quotas; see retry and fallback for when to switch vs queue.
- Admission control — return HTTP 503 with
Retry-Afterfrom your gateway before hitting the provider; saves money and preserves provider relationship. - Idempotency keys — dedupe retries at the gateway so a client timeout does not double-charge TPM on duplicate submits.
Self-hosted inference ( vLLM/TGI) shifts the bottleneck to GPU memory and batch size — limit concurrent sequences and max batch tokens instead of TPM, but the same priority-queue logic applies.
Multi-tenant budgets and cost caps
Quota management extends beyond provider limits to your economics:
- Per-tenant daily token budget — hard stop or downgrade to cheaper model when exceeded; surface usage in admin UI.
- Per-feature caps — free tier vs paid; agent loops vs single -shot summarize.
- Cost attribution — tag requests with
tenant_id,feature,prompt_versionfor chargeback and abuse detection. - Alerting — warn at 80% of budget; page when 429 rate spikes or queue depth exceeds SLO.
Pair with cost optimization: a tenant burning TPM on repeated identical queries should hit semantic cache before the limiter allows another full completion.
Harbor Support refactor (worked example)
Before the gateway, Harbor ran three services on one OpenAI org key: live triage chat (P0), email draft suggestions (P1), and a cron embedding refresh (P2). P2 fired 200 parallel embedding calls at 02:00 UTC; morning shift P0 latency spiked.
After refactor:
- LiteLLM proxy with Redis-backed dual buckets (TPM + RPM) per model.
- P0 reserved 70% TPM; P2 capped at 15% and only runs 00:00–05:00 local unless P0 utilization < 30%.
- Embedding refresh moved to provider batch API — zero contention with chat TPM.
- Gateway returns 503 +
Retry-After: 2when local bucket empty; clients use jittered backoff instead of hammering OpenAI. - Dashboards: queue depth by class, 429 rate (provider vs gateway), estimate error.
Result: p95 chat latency stable under 2s during embedding windows; monthly 429 count dropped 94%. Tier upgrade deferred one quarter.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Central gateway + token bucket | Multiple services, shared org quota, need priority lanes | Single tiny app with negligible traffic |
| Client-side semaphore only | Prototypes, one process, known low volume | Any multi-pod or multi-tenant deployment |
| Provider async batch API | Large offline jobs, separate quota pool, cost discounts | Sub-minute interactive SLAs |
| Circuit breaker + fallback model | Provider outage or hard 429 wall on primary model | Steady-state quota planning (treat as emergency, not plan A) |
| Semantic / prompt cache | Repeat queries, stable RAG chunks, high TPM burn on duplicates | Novel per-user long contexts every turn |
| Tier upgrade / dedicated capacity | Sustained legitimate growth after architecture is sound | Masking unfixed retry storms or missing batch routing |
Common pitfalls
- Retrying 429 without backoff — amplifies congestion; honor
Retry-Afterand jitter. - Per-pod limiters — ten pods × 10 concurrent = 100 to provider while each pod thinks it is safe.
- Ignoring output tokens in estimates —
max_tokensreserves headroom; agent loops compound completion TPM. - Sharing one key across prod and staging — load tests take down production chat.
- No P2 isolation — cron jobs and backfills belong on batch endpoints or off-peak schedules.
- Blocking the event loop — synchronous sleep in retry loops stalls all requests on that worker.
- Unbounded queue depth — users wait minutes without feedback; cap queue and fail with actionable errors.
- Upgrading tier before measuring — 429s often mean architecture, not capacity ceiling.
Production checklist
- Document provider TPM, RPM, and concurrent limits per model and tier.
- Route all LLM traffic through a central gateway with shared rate state.
- Implement dual buckets (tokens + requests) with pre-flight estimation.
- Define P0/P1/P2 traffic classes with reserved headroom for interactive use.
- Move deferrable workloads to async batch APIs or off-peak windows.
- Return gateway 503 with
Retry-Afterbefore provider 429 when possible. - Use exponential backoff + jitter on provider 429; cap max retries.
- Reconcile estimated vs actual tokens from response
usagemetadata. - Per-tenant daily budgets and feature-level caps with admin visibility.
- Alert on queue depth, 429 rate, and TPM utilization by class.
- Separate API keys for staging, production, and batch jobs.
Key takeaways
- TPM and RPM are independent constraints — limiters must track both.
- Centralize quota state; per-service semaphores multiply into provider storms.
- Priority lanes keep chat responsive; background work belongs on batch or off-peak.
- Admission control at your gateway beats blind retries against OpenAI 429s.
- Measure 429s and queue depth before buying the next tier upgrade.
Related reading
- LLM retry, fallback and resilience explained — backoff, circuit breakers, and when to switch models
- LLM async batch API explained — separate quota pools for overnight jobs
- LLM cost optimization explained — token budgets, caching, and model cascades
- Exponential backoff and retry patterns explained — jitter, caps, and Retry-After semantics