Guide

LLM agent rate limiting and throttling explained

Harbor Commerce launched a conversational shopping agent that could search inventory, compare variants, check shipping zones, and apply promo codes in one turn. The planner often issued six to nine tool calls in parallel — each hitting Shopify Admin API, a geocoding service, and OpenAI for re-ranking. Black Friday traffic pushed the stack over the edge: 22% of runs returned cascading 429 errors, the model retried blindly, and checkout completion dropped 41% → 19% on the worst day. Support tickets blamed “the bot is broken” when the real failure was missing backpressure between an eager agent loop and hard upstream quotas.

Agent rate limiting is the discipline of shaping request volume before external APIs reject you. Unlike a single REST gateway limit, production agents need layered quotas: provider TPM/RPM caps, per-tenant fairness, per-integration tool budgets, and per-user session ceilings. Throttling is the softer cousin — slowing work with queues and adaptive delays instead of hard drops. This guide covers limit layers, token-bucket mechanics, integration with middleware hooks and parallel execution, structured observations when limits fire, the Harbor Commerce refactor, a technique decision table, pitfalls, and a production checklist.

Why agents need limits beyond the LLM provider

A chatbot that calls one completion endpoint inherits the provider’s RPM and TPM quotas. An agent is a fan-out machine: each model turn may spawn multiple tool calls, each tool may call multiple downstream APIs, and retries multiply traffic. Without agent-side shaping:

Retry storms — the model sees a generic error, calls the same tool again, and amplifies a transient 429 into an outage.
Noisy-neighbor tenants — one enterprise customer’s bulk import agent exhausts shared Shopify credentials for everyone.
Cost spikes — unbounded parallel model calls during tool-wait loops burn tokens with no user-visible progress; tie limits to cost attribution so finance sees which tenant blew the budget.
Compliance exposure — CRM and payment APIs treat aggressive automation as abuse; your API key gets revoked.

Rate limits are not pessimism — they are how you keep the agent usable when upstreams say no. The goal is predictable degradation: queue, shed, or return structured “try again in N seconds” observations the model can reason about.

Four layers of agent quota design

Mature deployments stack limits from outer to inner. Each layer has different keys, windows, and failure modes.

1. Provider layer (LLM API)

Track requests-per-minute (RPM), tokens-per-minute (TPM), and concurrent-stream limits per model and API key. Pre-flight estimation — count prompt tokens before send, reserve output budget — prevents mid-stream truncation. When TPM is tight, prefer smaller models for tool-planning steps and reserve the large model for final user-facing answers.

2. Tenant and organization layer

SaaS agents isolate quotas by tenant_id. A typical pattern: base RPM per seat, burst bucket for interactive chat, separate batch pool for overnight jobs. Hard stops return a billing upsell message; soft throttles queue excess runs for up to 30 seconds before failing.

3. Tool and integration layer

Each external integration gets its own limiter keyed by credential + endpoint family. Shopify product search and inventory mutation should not share one bucket — reads are cheap, writes are scarce. Register limits in the middleware pipeline as a pre-tool hook so every adapter inherits policy without copy-paste.

4. Session and user layer

Per-session ceilings stop runaway loops: max tool calls per turn, max turns per minute, max cumulative cost per conversation. When the user asks the agent to “check every SKU in the catalog,” the session limit forces pagination or human escalation instead of ten thousand API calls.

Token bucket, leaky bucket, and sliding window

Most agent gateways implement token buckets: a bucket holds B tokens, refilled at rate R per second. Each tool call or model request consumes one or weighted tokens (a heavy write might cost five). Buckets allow controlled bursts — important for interactive agents that need three quick reads then idle.

Leaky buckets smooth output to a fixed drain rate; use them when downstream cannot tolerate bursts at all (legacy mainframe connectors, partner webhooks with strict SLA wording).

Sliding windows count events in the last N seconds; they avoid the “double burst at window boundary” bug fixed windows suffer. Hybrid designs are common: sliding window for compliance audit (“no more than 100 writes per hour”) plus token bucket for sub-second burst control.

Distributed agents store bucket state in Redis or a dedicated quota service with atomic Lua scripts. Local in-memory buckets are fine for single-process dev; production needs shared state or sticky routing per tenant.

Throttling vs hard rejection

Hard reject returns immediately when the bucket is empty. Fast, simple, but the model may interpret it as tool failure and retry — worsening the storm.

Throttle (queue) holds the tool call until tokens refill, up to a max wait. The user sees a “checking inventory…” spinner; the run stays alive. Cap queue depth: unbounded waits tie up worker threads and cancellation handles.

Adaptive throttling watches upstream Retry-After headers and error rates. When Shopify returns 429, the gateway lowers R globally for that credential for five minutes — proactive shaping before the key is suspended.

Shedding drops low-priority work under pressure: skip re-ranking, return cached catalog slices, defer non-critical audit writes. Document priority tiers so product knows what degrades first.

Structured limit responses the model can use

Never return a bare HTTP 429 string to the model. Use a JSON observation envelope consistent with tool error handling:

{
  "ok": false,
  "error_code": "RATE_LIMITED",
  "retryable": true,
  "retry_after_ms": 2400,
  "limit_scope": "shopify_inventory_read",
  "message": "Inventory API quota exhausted. Retry after 2.4s or narrow SKU list."
}

Include retry_after_ms so the runtime can sleep before re-invoking without another model turn. Mark retryable: false for daily caps (“tenant used 10,000 searches today”) so the model escalates to the user instead of looping. Log limit events with tenant, tool, and remaining bucket tokens for dashboards.

Interaction with parallel tool execution

Parallel calls multiply burst demand. If the planner requests five reads and each costs one token but the bucket has three, you have three choices:

Serialize — run two now, queue three (preserves correctness, adds latency).
Partial execute — run three, return limit errors for two with retry hints (model may replan with fewer calls).
Pre-check batch — reserve N tokens atomically before spawning goroutines; fail fast if insufficient.

The parallel execution scheduler should consult the limiter at batch planning time, not only at each HTTP client. Otherwise you win a race and still trip upstream 429s.

Harbor Commerce refactor walkthrough

Harbor rebuilt the shopping agent around a central QuotaCoordinator:

ProviderGate — TPM/RPM token buckets per model; pre-flight token count rejects oversized prompts before billing.
TenantFairness — weighted fair queuing across merchants; no single store monopolizes shared geocoder quota.
ToolLimiter — per-integration buckets in pre-tool middleware; write tools cost 5× read tokens.
SessionGuard — 40 tool calls per conversation hard cap with user-visible “continue in new chat” prompt.
AdaptiveBackoff — parses partner Retry-After; temporarily halves refill rate on sustained 429s.
StructuredObservations — all limit hits return JSON the planner was fine-tuned to respect; blind retries disabled for RATE_LIMITED codes.

Black Friday replay on production traffic: provider 429 rate 22% → 0.4%, p95 tool latency rose only 180ms from queuing, checkout completion recovered to 36% (still below baseline but 1.9× the unthrottled disaster). Merchant NPS on the agent channel stopped its three-week slide.

Technique decision table

Scenario	Prefer	Avoid
Interactive chat with bursty reads	Token bucket with moderate burst	Fixed window with zero burst
Multi-tenant SaaS agent	Per-tenant buckets + global ceiling	Single shared bucket
Partner API with Retry-After	Adaptive throttle from header	Immediate blind retry
Parallel tool batch	Atomic batch reservation	Per-goroutine unchecked calls
Daily compliance cap	Hard reject with retryable=false	Silent queue past midnight
Cost runaway in agent loop	Session spend cap + model downgrade	Unlimited TPM on flagship model

Common pitfalls

Limiting only the LLM — tools cause most partner bans; ignore them and quotas look fine until Shopify revokes access.
Retry without jitter — synchronized retries from many agents re-trigger 429 exactly when the window opens.
In-memory buckets in Kubernetes — each pod thinks it has full quota; sum exceeds partner limits.
Opaque errors to the model — “Error 429” encourages useless retries; structured codes reduce loops.
No priority under pressure — audit logging competes with checkout writes; define shed order.
Limits without observability — you cannot tune R if bucket depth is not metered.
Bypass for internal tools — admin scripts become the traffic spike that takes down production credentials.

Production checklist

Provider TPM/RPM buckets with pre-flight token estimation.
Per-tenant quotas with fair queuing across customers.
Per-integration tool limiters in pre-tool middleware.
Session caps on tool calls, turns, and cumulative spend.
Structured RATE_LIMITED observations with retry_after_ms.
Batch token reservation before parallel tool execution.
Adaptive backoff on partner Retry-After and error rates.
Exponential backoff with jitter on runtime retries.
Dashboards: bucket depth, reject vs queue ratio, 429 by integration.
Load test with limiter enabled — not only happy-path mocks.
Runbook for credential suspension and fallback read paths.

Key takeaways

Agents fan out traffic — provider quotas alone are never enough; tool and tenant layers are mandatory.
Token buckets balance burst and fairness — pair with sliding windows for compliance caps.
Throttle before you hard-fail — but cap queue depth and return structured errors when waits exceed UX tolerance.
Parallel execution needs batch-aware limits — reserve tokens before spawning concurrent calls.
Harbor Commerce cut 429 cascades from 22% to 0.4% with layered QuotaCoordinator gates and model-readable limit observations.