Guide

LLM rate limiting and quota management explained

Harbor Support's triage copilot launched to 400 concurrent agents on a Monday. Within twenty minutes, p95 latency climbed from 1.2 seconds to 38 seconds — not because the model slowed, but because every pod independently retried HTTP 429 responses from OpenAI without a shared budget. Worse, a nightly embedding job on the same API key consumed half the tokens-per-minute (TPM) pool, starving live chat. The fix was not “buy a bigger tier” alone: the team added a central gateway with dual token and request buckets, priority lanes for interactive traffic, and hard caps on background workloads routed to the batch API instead of realtime endpoints.

Rate limiting in LLM production is unlike traditional REST APIs. Providers enforce both requests per minute (RPM) and tokens per minute (TPM) — often per model, per organization, and sometimes per project. Burst traffic, long contexts, and parallel tool calls all burn TPM faster than RPM. Without deliberate quota architecture, your app discovers limits through cascading retries and angry users. This guide covers limit semantics, client vs gateway throttling, fair multi-tenant budgets, queueing strategies, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

TPM, RPM, and why both matter

Cloud LLM providers publish tier tables with separate ceilings. Confusing them is the most common quota bug in new deployments:

Limit type	What it counts	Typical pain point
TPM (tokens per minute)	Input + output tokens across requests in a rolling window	Long RAG prompts, 128k contexts, bulk classification
RPM (requests per minute)	HTTP completion calls regardless of size	Many small agent tool loops, high fan-out micro-calls
Concurrent requests	In-flight calls at once (some providers)	Streaming sessions held open; agent swarms
Batch quota	Separate pool for async JSONL jobs	Backfills that should never touch realtime TPM

A request can be under RPM but blow TPM — one 80k-token document summary blocks dozens of short chats if they share a bucket. Conversely, a thousand tiny classification calls can hit RPM while TPM looks fine. Your limiter must track both dimensions and estimate tokens before dispatch when possible (tiktoken, provider tokenizers, or conservative heuristics).

Throttling algorithms: token bucket vs leaky bucket

Most production gateways implement a token bucket (or dual buckets for RPM and TPM):

Token bucket — capacity refills at a steady rate; bursts consume accumulated tokens up to the cap. Matches provider “per minute with burst” behavior well.
Leaky bucket — requests drip out at fixed rate; smoother output, less burst-friendly. Useful for protecting downstream GPUs on self-hosted stacks.
Sliding window log — exact counts over the last 60 seconds; higher memory cost but precise when provider windows are strict.

Client libraries often ship naive semaphores (“max 10 parallel”) that ignore TPM entirely. Replace them with a shared limiter service (Redis, in-memory on a single gateway, or embedded in LiteLLM) that decrements estimated tokens on enqueue and reconciles with actual usage from response metadata.

Pre-flight estimation vs post-hoc accounting

Pre-flight: reject or queue when estimated_input + max_output exceeds remaining TPM — prevents 429 storms. Post-hoc: adjust bucket with usage.prompt_tokens + usage.completion_tokens from the response; correct drift when estimates were wrong. Log both estimate error and 429 rate in observability dashboards.

Priority queues and traffic classes

Flat FIFO queues treat a bulk re-embed job the same as a paying user's chat turn. Production systems define traffic classes with weighted fair queuing:

Interactive (P0) — user-visible chat, copilot, live agent turns. Reserved TPM headroom (e.g. 60% of org budget).
Near-realtime (P1) — webhooks, email draft generation, sub-30s SLA. Borrows spare capacity; yields instantly to P0.
Background (P2) — evals, analytics, index rebuilds. Hard cap; overflow spills to async batch or off-peak windows.

Implement with separate Redis streams per class, a scheduler that always drains P0 before P1, and max_wait_ms deadlines — if P1 cannot start within SLA, fail fast with a clear message rather than wedging behind P2. For multi-tenant SaaS, nest per-tenant sub-buckets inside each class so one noisy customer cannot exhaust the org-wide TPM.

Gateway patterns: centralize or suffer

Every microservice holding its own API key and retry loop multiplies 429s. Standard pattern:

Single egress gateway — all LLM calls flow through one service (or LiteLLM proxy) that owns rate state.
Key rotation and model routing — secondary keys or fallback models with independent quotas; see retry and fallback for when to switch vs queue.
Admission control — return HTTP 503 with Retry-After from your gateway before hitting the provider; saves money and preserves provider relationship.
Idempotency keys — dedupe retries at the gateway so a client timeout does not double-charge TPM on duplicate submits.

Self-hosted inference ( vLLM/TGI) shifts the bottleneck to GPU memory and batch size — limit concurrent sequences and max batch tokens instead of TPM, but the same priority-queue logic applies.

Multi-tenant budgets and cost caps

Quota management extends beyond provider limits to your economics:

Per-tenant daily token budget — hard stop or downgrade to cheaper model when exceeded; surface usage in admin UI.
Per-feature caps — free tier vs paid; agent loops vs single -shot summarize.
Cost attribution — tag requests with tenant_id, feature, prompt_version for chargeback and abuse detection.
Alerting — warn at 80% of budget; page when 429 rate spikes or queue depth exceeds SLO.

Pair with cost optimization: a tenant burning TPM on repeated identical queries should hit semantic cache before the limiter allows another full completion.

Harbor Support refactor (worked example)

Before the gateway, Harbor ran three services on one OpenAI org key: live triage chat (P0), email draft suggestions (P1), and a cron embedding refresh (P2). P2 fired 200 parallel embedding calls at 02:00 UTC; morning shift P0 latency spiked.

After refactor:

LiteLLM proxy with Redis-backed dual buckets (TPM + RPM) per model.
P0 reserved 70% TPM; P2 capped at 15% and only runs 00:00–05:00 local unless P0 utilization < 30%.
Embedding refresh moved to provider batch API — zero contention with chat TPM.
Gateway returns 503 + Retry-After: 2 when local bucket empty; clients use jittered backoff instead of hammering OpenAI.
Dashboards: queue depth by class, 429 rate (provider vs gateway), estimate error.

Result: p95 chat latency stable under 2s during embedding windows; monthly 429 count dropped 94%. Tier upgrade deferred one quarter.

Technique decision table

Approach	Best when	Weak when
Central gateway + token bucket	Multiple services, shared org quota, need priority lanes	Single tiny app with negligible traffic
Client-side semaphore only	Prototypes, one process, known low volume	Any multi-pod or multi-tenant deployment
Provider async batch API	Large offline jobs, separate quota pool, cost discounts	Sub-minute interactive SLAs
Circuit breaker + fallback model	Provider outage or hard 429 wall on primary model	Steady-state quota planning (treat as emergency, not plan A)
Semantic / prompt cache	Repeat queries, stable RAG chunks, high TPM burn on duplicates	Novel per-user long contexts every turn
Tier upgrade / dedicated capacity	Sustained legitimate growth after architecture is sound	Masking unfixed retry storms or missing batch routing

Common pitfalls

Retrying 429 without backoff — amplifies congestion; honor Retry-After and jitter.
Per-pod limiters — ten pods × 10 concurrent = 100 to provider while each pod thinks it is safe.
Ignoring output tokens in estimates — max_tokens reserves headroom; agent loops compound completion TPM.
Sharing one key across prod and staging — load tests take down production chat.
No P2 isolation — cron jobs and backfills belong on batch endpoints or off-peak schedules.
Blocking the event loop — synchronous sleep in retry loops stalls all requests on that worker.
Unbounded queue depth — users wait minutes without feedback; cap queue and fail with actionable errors.
Upgrading tier before measuring — 429s often mean architecture, not capacity ceiling.

Production checklist

Document provider TPM, RPM, and concurrent limits per model and tier.
Route all LLM traffic through a central gateway with shared rate state.
Implement dual buckets (tokens + requests) with pre-flight estimation.
Define P0/P1/P2 traffic classes with reserved headroom for interactive use.
Move deferrable workloads to async batch APIs or off-peak windows.
Return gateway 503 with Retry-After before provider 429 when possible.
Use exponential backoff + jitter on provider 429; cap max retries.
Reconcile estimated vs actual tokens from response usage metadata.
Per-tenant daily budgets and feature-level caps with admin visibility.
Alert on queue depth, 429 rate, and TPM utilization by class.
Separate API keys for staging, production, and batch jobs.

Key takeaways

TPM and RPM are independent constraints — limiters must track both.
Centralize quota state; per-service semaphores multiply into provider storms.
Priority lanes keep chat responsive; background work belongs on batch or off-peak.
Admission control at your gateway beats blind retries against OpenAI 429s.
Measure 429s and queue depth before buying the next tier upgrade.