Explainer · 7 June 2026
How rate limiting algorithms work
A public API, a blockchain RPC node, and a login endpoint all answer the same underlying question dozens of times per second: is this caller allowed to proceed right now? Rate limiting is the family of algorithms that answer it with a number — tokens remaining, queue depth, or requests in the last N milliseconds. The algorithm you pick is not cosmetic. It changes whether clients can burst after idle periods, whether traffic exits at a perfectly smooth drip, and whether two requests one millisecond apart can double your effective quota. This explainer focuses on the mechanics: what each algorithm stores, how it updates state on every request, and how to make limits consistent across a fleet of servers.
The shared abstraction: permits per time
Every limiter tracks state keyed by a dimension — API key,
user ID, IP address, route, or a composite like org:123:export.
On each incoming request the limiter either decrements available capacity and
allows the call, or rejects it (HTTP 429, gRPC
RESOURCE_EXHAUSTED, JSON-RPC -32429 rate limited).
Two parameters appear in almost every design:
- Sustained rate — long-run average requests allowed (e.g. 100 per second).
- Burst capacity — how far above the average a caller may spike before rejection (e.g. allow 500 in a short window, then throttle).
Algorithms differ in how they represent and refill that capacity — not in the business goal. For HTTP response headers, client backoff, and where to place limits in your stack, see our practical API rate limiting guide. Here we stay at the algorithm layer.
Token bucket
The token bucket is the workhorse of modern gateways. State per
key is two numbers: tokens (current balance) and
last_refill (timestamp). Parameters are capacity
B (max tokens) and refill rate r tokens per
second.
On each request:
- Compute elapsed time since
last_refill. - Add
elapsed * rtokens, capped atB. - If
tokens >= 1, subtract one and allow; else reject.
The elegance is lazy refill — you never need a background timer ticking every millisecond. Refill happens only when a request arrives, which makes the bucket cheap in single-threaded middleware and easy to port to Redis with one atomic script.
Burst behavior is explicit: a client idle for ten seconds accumulates up to
B tokens, then can spend them in a rapid burst before throttling to
rate r. That matches how humans use APIs — quiet, then a batch
import — and why Stripe, GitHub, and many RPC providers document limits as
"requests per second with a burst allowance."
Generic Cell Rate Algorithm (GCRA) is a mathematically equivalent
formulation used in telecom and in libraries like Envoy's global rate limit
service. Instead of counting tokens, GCRA tracks a theoretical arrival time
(TAT) for the next permitted request. Each arrival advances TAT by the
inter-arrival interval 1/r, but TAT cannot lag more than
B/r behind real time. GCRA produces the same allow/deny decisions
as a token bucket with less floating-point drift in distributed implementations.
Leaky bucket
The leaky bucket metaphor is often confused with token bucket, but the semantics differ. Requests enter a FIFO queue; the bucket "leaks" at a fixed rate — one request processed (or one byte forwarded) per tick. If the queue is full, new arrivals are dropped or rejected immediately.
Key property: output is smoothed. Even if 1,000 clients hammer you in the same millisecond, downstream sees a steady drip at the leak rate. That protects fragile backends — legacy billing mainframes, partner webhooks with strict SLA, GPU inference queues — that cannot absorb spikes even if average load is fine.
Implementation variants:
- Queue + worker — requests wait in line; latency grows under overload instead of failing fast. Users see slow responses, not 429.
- Drop-tail / drop-head — queue has max depth; excess is rejected. Closer to "fail fast" semantics.
Token bucket shapes admission (can you enter?). Leaky bucket shapes egress (when does work leave?). Many systems combine both: token bucket at the edge for coarse admission, leaky bucket or worker pool inside for smooth processing. Pair egress shaping with load balancing so rejected or delayed work does not pile onto one unhealthy instance.
Fixed window counter
The simplest distributed limit: partition time into non-overlapping windows
(calendar minute, hour) and count requests per key per window. In Redis:
INCR rl:user:42:2026-06-07T12:30 with a TTL of 60 seconds; reject
when count exceeds the max.
Pros: one atomic increment, tiny memory, trivial to explain in docs ("1,000 requests per hour"). Cons: the window boundary problem. A client sending 1,000 requests at 12:29:59 and another 1,000 at 12:30:00 executes 2,000 in two seconds while each window individually looks compliant. For bursty mobile apps or cron jobs aligned to the minute, effective throughput can be nearly double the documented limit.
Mitigations without full sliding windows:
- Staggered sub-windows — track four 15-second counters inside a minute and require all to pass.
- Randomized window start per key — spreads boundary effects across the fleet (harder to document clearly).
Sliding window log and sliding window counter
A sliding window log stores timestamps of recent requests (or
their hashes). On each arrival, drop entries older than now - W,
count what remains, and reject if count >= limit. Accuracy is exact for the
definition "no more than N requests in any contiguous W-second interval."
Cost: memory proportional to N per key. At 10,000 req/s per hot key, storing 10,000 timestamps per second is prohibitive. Production systems use approximations.
The sliding window counter (popularized by Cloudflare and Redis rate-limit modules) blends the current fixed window count with the previous window, weighted by how far you are into the current window:
estimated = count_prev * (1 - elapsed/W) + count_curr
Reject when estimated >= limit. Memory stays O(1) per key (two
counters), boundary spikes are damped, and behavior is close enough for most
SaaS APIs. The trade-off is slight imprecision — you may occasionally allow or
deny a request a well-behaved sliding log would not — which is acceptable when
the goal is stopping abuse, not billing to the exact request.
Distributed rate limiting
An in-process HashMap<String, TokenBucket> breaks the moment
you run two app servers behind a
reverse proxy:
each instance grants the full quota, so effective limits double with every replica.
Centralized stores — Redis, Memcached, DynamoDB — hold authoritative counters.
Critical implementation details:
- Atomicity — use
INCR, Lua scripts, or Redis Cell (GCRA) so read-modify-write races do not over-grant under concurrency. - Clock skew — token refill based on wall clocks across nodes can drift. Prefer monotonic time per server for local buckets; for Redis, let the store's clock be authoritative.
- Latency tax — every request pays a network round trip to Redis. Mitigate with local token lending: each instance holds a small batch of permits synced periodically; hot paths check local state, background tasks reconcile with central counters.
- Failure mode — if Redis is down, choose fail open (allow traffic, risk overload) or fail closed (reject, protect backend). Payment and auth paths often fail closed; read-heavy CDNs fail open with a local emergency cap.
Eventually consistent limits are acceptable for abuse prevention: allowing 105 requests when the limit is 100 beats melting Postgres. Strict billing quotas need stronger consistency — often a single coordination service or per-tenant ledger, not a best-effort counter.
Choosing an algorithm
| Algorithm | State per key | Burst tolerance | Best when |
|---|---|---|---|
| Token bucket / GCRA | O(1) | High — bounded burst | Public REST APIs, RPC providers, per-key SaaS tiers |
| Leaky bucket | Queue depth | Low — smooth output | Protecting fixed-speed downstream workers |
| Fixed window | O(1) | Spikes at boundaries | Simple Redis counters, low-traffic internal APIs |
| Sliding window counter | O(1) | Moderate | High-traffic edge gateways needing fair windows |
| Sliding window log | O(N) | Exact | Low-volume strict quotas, security-sensitive endpoints |
Layer limits: coarse per-IP cap at the CDN, per-API-key token bucket at the gateway, stricter bucket on expensive routes (report export, chain simulation) in application middleware. Limits are admission control; circuit breakers are health control — reject because quota is exhausted vs reject because the dependency is on fire. Use both: rate limits stop the flood; breakers stop retry storms from propagating when something is already sick.
Common pitfalls
- Retry amplification — clients that instantly retry 429s multiply load. Document backoff; server-side, consider penalizing repeat offenders with temporary ban buckets.
- Shared NAT keys — IP-only limits punish corporate offices. Combine IP ceilings with authenticated per-key buckets.
- Ignoring cost heterogeneity — one GET on
/healthis not equal to one POST that runs a simulation. Cost-based token buckets spend multiple tokens on expensive operations. - Limits without observability — chart 429 rate, top limited keys, and limiter Redis latency. Spikes often reveal a deploy bug before users file tickets.
- Double limits that fight — edge allows, app rejects (or vice versa) with different windows. Document the effective limit as the minimum of all layers.
Practical checklist
- Pick algorithm based on burst vs smoothness requirements, not familiarity.
- Centralize counters for multi-instance deployments; use atomic updates.
- Define dimensions (key, IP, route) and document effective quotas per tier.
- Return machine-readable rejection metadata (
Retry-After, reset time). - Pair limits with circuit breakers and timeouts on outbound dependencies.
- Load-test limits in staging — especially window-boundary and burst-after-idle cases.
Related on Solana Garden: API rate limiting guide (HTTP 429 and client backoff), REST API design guide, Circuit breakers explained, Load balancing explained, Explainers hub.