Explainer · 7 June 2026

How rate limiting algorithms work

A public API, a blockchain RPC node, and a login endpoint all answer the same underlying question dozens of times per second: is this caller allowed to proceed right now? Rate limiting is the family of algorithms that answer it with a number — tokens remaining, queue depth, or requests in the last N milliseconds. The algorithm you pick is not cosmetic. It changes whether clients can burst after idle periods, whether traffic exits at a perfectly smooth drip, and whether two requests one millisecond apart can double your effective quota. This explainer focuses on the mechanics: what each algorithm stores, how it updates state on every request, and how to make limits consistent across a fleet of servers.

The shared abstraction: permits per time

Every limiter tracks state keyed by a dimension — API key, user ID, IP address, route, or a composite like org:123:export. On each incoming request the limiter either decrements available capacity and allows the call, or rejects it (HTTP 429, gRPC RESOURCE_EXHAUSTED, JSON-RPC -32429 rate limited).

Two parameters appear in almost every design:

  • Sustained rate — long-run average requests allowed (e.g. 100 per second).
  • Burst capacity — how far above the average a caller may spike before rejection (e.g. allow 500 in a short window, then throttle).

Algorithms differ in how they represent and refill that capacity — not in the business goal. For HTTP response headers, client backoff, and where to place limits in your stack, see our practical API rate limiting guide. Here we stay at the algorithm layer.

Token bucket

The token bucket is the workhorse of modern gateways. State per key is two numbers: tokens (current balance) and last_refill (timestamp). Parameters are capacity B (max tokens) and refill rate r tokens per second.

On each request:

  1. Compute elapsed time since last_refill.
  2. Add elapsed * r tokens, capped at B.
  3. If tokens >= 1, subtract one and allow; else reject.

The elegance is lazy refill — you never need a background timer ticking every millisecond. Refill happens only when a request arrives, which makes the bucket cheap in single-threaded middleware and easy to port to Redis with one atomic script.

Burst behavior is explicit: a client idle for ten seconds accumulates up to B tokens, then can spend them in a rapid burst before throttling to rate r. That matches how humans use APIs — quiet, then a batch import — and why Stripe, GitHub, and many RPC providers document limits as "requests per second with a burst allowance."

Generic Cell Rate Algorithm (GCRA) is a mathematically equivalent formulation used in telecom and in libraries like Envoy's global rate limit service. Instead of counting tokens, GCRA tracks a theoretical arrival time (TAT) for the next permitted request. Each arrival advances TAT by the inter-arrival interval 1/r, but TAT cannot lag more than B/r behind real time. GCRA produces the same allow/deny decisions as a token bucket with less floating-point drift in distributed implementations.

Leaky bucket

The leaky bucket metaphor is often confused with token bucket, but the semantics differ. Requests enter a FIFO queue; the bucket "leaks" at a fixed rate — one request processed (or one byte forwarded) per tick. If the queue is full, new arrivals are dropped or rejected immediately.

Key property: output is smoothed. Even if 1,000 clients hammer you in the same millisecond, downstream sees a steady drip at the leak rate. That protects fragile backends — legacy billing mainframes, partner webhooks with strict SLA, GPU inference queues — that cannot absorb spikes even if average load is fine.

Implementation variants:

  • Queue + worker — requests wait in line; latency grows under overload instead of failing fast. Users see slow responses, not 429.
  • Drop-tail / drop-head — queue has max depth; excess is rejected. Closer to "fail fast" semantics.

Token bucket shapes admission (can you enter?). Leaky bucket shapes egress (when does work leave?). Many systems combine both: token bucket at the edge for coarse admission, leaky bucket or worker pool inside for smooth processing. Pair egress shaping with load balancing so rejected or delayed work does not pile onto one unhealthy instance.

Fixed window counter

The simplest distributed limit: partition time into non-overlapping windows (calendar minute, hour) and count requests per key per window. In Redis: INCR rl:user:42:2026-06-07T12:30 with a TTL of 60 seconds; reject when count exceeds the max.

Pros: one atomic increment, tiny memory, trivial to explain in docs ("1,000 requests per hour"). Cons: the window boundary problem. A client sending 1,000 requests at 12:29:59 and another 1,000 at 12:30:00 executes 2,000 in two seconds while each window individually looks compliant. For bursty mobile apps or cron jobs aligned to the minute, effective throughput can be nearly double the documented limit.

Mitigations without full sliding windows:

  • Staggered sub-windows — track four 15-second counters inside a minute and require all to pass.
  • Randomized window start per key — spreads boundary effects across the fleet (harder to document clearly).

Sliding window log and sliding window counter

A sliding window log stores timestamps of recent requests (or their hashes). On each arrival, drop entries older than now - W, count what remains, and reject if count >= limit. Accuracy is exact for the definition "no more than N requests in any contiguous W-second interval."

Cost: memory proportional to N per key. At 10,000 req/s per hot key, storing 10,000 timestamps per second is prohibitive. Production systems use approximations.

The sliding window counter (popularized by Cloudflare and Redis rate-limit modules) blends the current fixed window count with the previous window, weighted by how far you are into the current window:

estimated = count_prev * (1 - elapsed/W) + count_curr

Reject when estimated >= limit. Memory stays O(1) per key (two counters), boundary spikes are damped, and behavior is close enough for most SaaS APIs. The trade-off is slight imprecision — you may occasionally allow or deny a request a well-behaved sliding log would not — which is acceptable when the goal is stopping abuse, not billing to the exact request.

Distributed rate limiting

An in-process HashMap<String, TokenBucket> breaks the moment you run two app servers behind a reverse proxy: each instance grants the full quota, so effective limits double with every replica. Centralized stores — Redis, Memcached, DynamoDB — hold authoritative counters.

Critical implementation details:

  • Atomicity — use INCR, Lua scripts, or Redis Cell (GCRA) so read-modify-write races do not over-grant under concurrency.
  • Clock skew — token refill based on wall clocks across nodes can drift. Prefer monotonic time per server for local buckets; for Redis, let the store's clock be authoritative.
  • Latency tax — every request pays a network round trip to Redis. Mitigate with local token lending: each instance holds a small batch of permits synced periodically; hot paths check local state, background tasks reconcile with central counters.
  • Failure mode — if Redis is down, choose fail open (allow traffic, risk overload) or fail closed (reject, protect backend). Payment and auth paths often fail closed; read-heavy CDNs fail open with a local emergency cap.

Eventually consistent limits are acceptable for abuse prevention: allowing 105 requests when the limit is 100 beats melting Postgres. Strict billing quotas need stronger consistency — often a single coordination service or per-tenant ledger, not a best-effort counter.

Choosing an algorithm

Algorithm State per key Burst tolerance Best when
Token bucket / GCRA O(1) High — bounded burst Public REST APIs, RPC providers, per-key SaaS tiers
Leaky bucket Queue depth Low — smooth output Protecting fixed-speed downstream workers
Fixed window O(1) Spikes at boundaries Simple Redis counters, low-traffic internal APIs
Sliding window counter O(1) Moderate High-traffic edge gateways needing fair windows
Sliding window log O(N) Exact Low-volume strict quotas, security-sensitive endpoints

Layer limits: coarse per-IP cap at the CDN, per-API-key token bucket at the gateway, stricter bucket on expensive routes (report export, chain simulation) in application middleware. Limits are admission control; circuit breakers are health control — reject because quota is exhausted vs reject because the dependency is on fire. Use both: rate limits stop the flood; breakers stop retry storms from propagating when something is already sick.

Common pitfalls

  • Retry amplification — clients that instantly retry 429s multiply load. Document backoff; server-side, consider penalizing repeat offenders with temporary ban buckets.
  • Shared NAT keys — IP-only limits punish corporate offices. Combine IP ceilings with authenticated per-key buckets.
  • Ignoring cost heterogeneity — one GET on /health is not equal to one POST that runs a simulation. Cost-based token buckets spend multiple tokens on expensive operations.
  • Limits without observability — chart 429 rate, top limited keys, and limiter Redis latency. Spikes often reveal a deploy bug before users file tickets.
  • Double limits that fight — edge allows, app rejects (or vice versa) with different windows. Document the effective limit as the minimum of all layers.

Practical checklist

  • Pick algorithm based on burst vs smoothness requirements, not familiarity.
  • Centralize counters for multi-instance deployments; use atomic updates.
  • Define dimensions (key, IP, route) and document effective quotas per tier.
  • Return machine-readable rejection metadata (Retry-After, reset time).
  • Pair limits with circuit breakers and timeouts on outbound dependencies.
  • Load-test limits in staging — especially window-boundary and burst-after-idle cases.

Related on Solana Garden: API rate limiting guide (HTTP 429 and client backoff), REST API design guide, Circuit breakers explained, Load balancing explained, Explainers hub.