Guide

API rate limiting explained: algorithms, 429 errors, and backoff

A single misconfigured cron job can send ten thousand requests per second to an endpoint that was sized for ten. Rate limiting is how APIs say "slow down" before that traffic melts the database, exhausts an upstream quota, or triggers a cloud bill spike. If you have ever seen 429 Too Many Requests from GitHub, Stripe, or a blockchain RPC, you have already met the client side of this pattern. This guide explains how limits are enforced, what the HTTP response means, and how to build clients that recover gracefully instead of hammering harder.

Why APIs rate-limit at all

Public endpoints share finite resources: CPU on app servers, connection slots in Postgres, egress bandwidth, and third-party API credits. Without limits, one noisy neighbor — a scraper, a bugged retry loop, or a viral launch — can degrade service for everyone else.

Good rate limiting balances three goals:

  • Fairness — no single API key or IP should monopolize capacity.
  • Stability — shed load before cascading failures (timeouts, connection pool exhaustion).
  • Predictability — documented quotas so integrators can size their apps.

Limits are not only about abuse. They also protect you from your own success: a webhook fan-out, a batch importer, or a mobile app update that polls too aggressively can look like an attack from the server's perspective.

Common rate-limiting algorithms

Every implementation approximates the same question: how many requests is this caller allowed in this time window? The algorithm you pick changes burst tolerance and how "fair" resets feel to clients.

Token bucket

Imagine a bucket that holds B tokens. Tokens refill at a steady rate (for example, 100 per second). Each request spends one token. If the bucket is empty, the request is rejected.

Token buckets are popular because they allow controlled bursts: a client can send 100 requests immediately if it has been idle, then throttle to the refill rate. Stripe and many RPC providers use variants of this model.

Leaky bucket

Requests enter a queue and leave at a fixed drip rate. Bursts are smoothed — excess requests wait or are dropped. Leaky buckets produce very steady outbound traffic, which is useful when downstream systems cannot handle spikes (legacy mainframes, strict partner SLAs).

Fixed window counter

Count requests per calendar minute (or hour). Simple to implement in Redis: INCR user:123:2026-06-07T11:30 and compare to a max. The downside is the window edge problem: a client can send 100 requests at 11:29:59 and another 100 at 11:30:00 — 200 in two seconds while each window looks fine.

Sliding window log or counter

Track timestamps of recent requests and count only those in the last N seconds. More accurate than fixed windows, slightly more memory per key. Hybrid sliding window counter schemes (weighted blend of current and previous window) are a common compromise in high-traffic gateways.

What HTTP 429 means

429 Too Many Requests tells the client the limit was exceeded. Unlike 503 Service Unavailable (server overload) or 403 Forbidden (authorization failure), 429 is explicitly your request was understood but rejected by quota policy. Well-behaved clients should back off, not retry instantly in a tight loop.

Useful response headers (de facto standards, not always present):

  • Retry-After — seconds or HTTP-date until retry is welcome.
  • X-RateLimit-Limit — quota for the window.
  • X-RateLimit-Remaining — requests left before the next reset.
  • X-RateLimit-Reset — Unix timestamp when the bucket refills.

Blockchain RPC nodes often return JSON-RPC errors with code -32429 and message rate limited instead of plain HTTP 429 — same idea, different wire format. Our Solana RPC endpoints guide covers fallback endpoints when primary RPCs throttle you.

Client-side recovery: backoff and jitter

The worst thing a client can do after a 429 is retry immediately — that turns one overloaded service into a retry storm. Standard pattern:

  1. Read Retry-After if present; sleep that long.
  2. Otherwise use exponential backoff: wait 1s, then 2s, 4s, 8s… capped at a max (often 30–60s).
  3. Add jitter — randomize ±20% of the delay so ten thousand clients do not wake up in sync.
  4. Give up after N attempts and surface a user-visible error or queue the job for later.

For idempotent reads (GET, status polls), retries are safe. For writes (POST that creates a charge), use idempotency keys so a retried request cannot double-charge. Payment verification flows — like checking whether a Solana transfer landed — should space polls at human-scale intervals (a few seconds), not sub-second loops that burn RPC quota.

Long-lived connections reduce poll pressure. If you need live updates, prefer WebSockets or server-sent events over hammering a REST status endpoint every 200 ms.

Where to enforce limits

Rate limits can live at several layers — often more than one:

  • Edge / CDN — cheap, stops garbage before it hits your origin. Pair with HTTP caching so repeat reads never reach the app.
  • API gateway — central place for per-key quotas, JWT claims, and WAF rules.
  • Application middleware — fine-grained limits per route (login vs search vs export).
  • Database — connection pool caps and query timeouts are implicit rate limits; slow queries are often worse than rejected requests.

Per-API-key limits beat raw IP limits when traffic comes through NAT (corporate offices, mobile carriers). IP limits still help against unauthenticated scraping. Combine both: anonymous IPs get a low ceiling; authenticated keys get higher tiers.

Distributed rate limiting pitfalls

A single-server in-memory counter breaks the moment you run two instances behind a load balancer — each box thinks it has the full quota. Shared stores (Redis, Memcached, DynamoDB) centralize counts but add latency and a new failure mode.

Practical tips:

  • Use atomic increment-with-TTL operations; avoid read-modify-write races.
  • Prefer eventual consistency for soft limits — slightly exceeding quota is OK if you stop the flood.
  • Log limit hits with caller identity; spikes often reveal a deploy bug before users complain.
  • Separate costly endpoints (report generation, chain simulation) into stricter buckets than cheap health checks.

Database-heavy endpoints should also be optimized — a rate-limited query that still scans a million rows wastes disk I/O. Indexes and query plans matter as much as request counts; see our database indexing guide for the data-layer side of the same problem.

Designing quotas developers will tolerate

Opaque limits breed angry integrators. Document:

  • Requests per second and per day for each tier.
  • Whether limits are per key, per IP, or per organization.
  • What happens at 80% of quota (warning header?) vs 100% (hard 429).
  • How to request a higher tier and what metrics you use to approve it.

Return structured error bodies: {"error":"rate_limit_exceeded","retry_after":12,"limit":100,"window":"1m"} so SDKs can parse them. Machine-readable beats prose in a plain-text body.

Test your own limits in staging with load tools before launch. The goal is not to block legitimate traffic — it is to cap the tail risk so the API stays fast for everyone during a traffic spike.

Quick reference

Algorithm Burst friendly? Typical use
Token bucket Yes Public REST APIs, RPC providers
Leaky bucket No — smooth output Strict downstream partners
Fixed window Edge spikes at boundaries Simple Redis counters
Sliding window Moderate High-accuracy SaaS APIs

Related reading