Guide
API rate limiting explained: algorithms, 429 errors, and backoff
A single misconfigured cron job can send ten thousand requests per second to an
endpoint that was sized for ten. Rate limiting is how APIs say
"slow down" before that traffic melts the database, exhausts an upstream quota, or
triggers a cloud bill spike. If you have ever seen 429 Too Many Requests
from GitHub, Stripe, or a blockchain RPC, you have already met the client side of
this pattern. This guide explains how limits are enforced, what the HTTP response
means, and how to build clients that recover gracefully instead of hammering harder.
Why APIs rate-limit at all
Public endpoints share finite resources: CPU on app servers, connection slots in Postgres, egress bandwidth, and third-party API credits. Without limits, one noisy neighbor — a scraper, a bugged retry loop, or a viral launch — can degrade service for everyone else.
Good rate limiting balances three goals:
- Fairness — no single API key or IP should monopolize capacity.
- Stability — shed load before cascading failures (timeouts, connection pool exhaustion).
- Predictability — documented quotas so integrators can size their apps.
Limits are not only about abuse. They also protect you from your own success: a webhook fan-out, a batch importer, or a mobile app update that polls too aggressively can look like an attack from the server's perspective.
Common rate-limiting algorithms
Every implementation approximates the same question: how many requests is this caller allowed in this time window? The algorithm you pick changes burst tolerance and how "fair" resets feel to clients.
Token bucket
Imagine a bucket that holds B tokens. Tokens refill at a steady rate (for example, 100 per second). Each request spends one token. If the bucket is empty, the request is rejected.
Token buckets are popular because they allow controlled bursts: a client can send 100 requests immediately if it has been idle, then throttle to the refill rate. Stripe and many RPC providers use variants of this model.
Leaky bucket
Requests enter a queue and leave at a fixed drip rate. Bursts are smoothed — excess requests wait or are dropped. Leaky buckets produce very steady outbound traffic, which is useful when downstream systems cannot handle spikes (legacy mainframes, strict partner SLAs).
Fixed window counter
Count requests per calendar minute (or hour). Simple to implement in Redis:
INCR user:123:2026-06-07T11:30 and compare to a max. The downside is
the window edge problem: a client can send 100 requests at
11:29:59 and another 100 at 11:30:00 — 200 in two seconds while each window
looks fine.
Sliding window log or counter
Track timestamps of recent requests and count only those in the last N seconds. More accurate than fixed windows, slightly more memory per key. Hybrid sliding window counter schemes (weighted blend of current and previous window) are a common compromise in high-traffic gateways.
What HTTP 429 means
429 Too Many Requests tells the client the limit was exceeded. Unlike
503 Service Unavailable (server overload) or 403 Forbidden
(authorization failure), 429 is explicitly your request was understood but
rejected by quota policy. Well-behaved clients should back off, not retry
instantly in a tight loop.
Useful response headers (de facto standards, not always present):
Retry-After— seconds or HTTP-date until retry is welcome.X-RateLimit-Limit— quota for the window.X-RateLimit-Remaining— requests left before the next reset.X-RateLimit-Reset— Unix timestamp when the bucket refills.
Blockchain RPC nodes often return JSON-RPC errors with code -32429 and
message rate limited instead of plain HTTP 429 — same idea, different
wire format. Our
Solana RPC endpoints guide
covers fallback endpoints when primary RPCs throttle you.
Client-side recovery: backoff and jitter
The worst thing a client can do after a 429 is retry immediately — that turns one overloaded service into a retry storm. Standard pattern:
- Read
Retry-Afterif present; sleep that long. - Otherwise use exponential backoff: wait 1s, then 2s, 4s, 8s… capped at a max (often 30–60s).
- Add jitter — randomize ±20% of the delay so ten thousand clients do not wake up in sync.
- Give up after N attempts and surface a user-visible error or queue the job for later.
For idempotent reads (GET, status polls), retries are safe. For writes (POST that creates a charge), use idempotency keys so a retried request cannot double-charge. Payment verification flows — like checking whether a Solana transfer landed — should space polls at human-scale intervals (a few seconds), not sub-second loops that burn RPC quota.
Long-lived connections reduce poll pressure. If you need live updates, prefer WebSockets or server-sent events over hammering a REST status endpoint every 200 ms.
Where to enforce limits
Rate limits can live at several layers — often more than one:
- Edge / CDN — cheap, stops garbage before it hits your origin. Pair with HTTP caching so repeat reads never reach the app.
- API gateway — central place for per-key quotas, JWT claims, and WAF rules.
- Application middleware — fine-grained limits per route (login vs search vs export).
- Database — connection pool caps and query timeouts are implicit rate limits; slow queries are often worse than rejected requests.
Per-API-key limits beat raw IP limits when traffic comes through NAT (corporate offices, mobile carriers). IP limits still help against unauthenticated scraping. Combine both: anonymous IPs get a low ceiling; authenticated keys get higher tiers.
Distributed rate limiting pitfalls
A single-server in-memory counter breaks the moment you run two instances behind a load balancer — each box thinks it has the full quota. Shared stores (Redis, Memcached, DynamoDB) centralize counts but add latency and a new failure mode.
Practical tips:
- Use atomic increment-with-TTL operations; avoid read-modify-write races.
- Prefer eventual consistency for soft limits — slightly exceeding quota is OK if you stop the flood.
- Log limit hits with caller identity; spikes often reveal a deploy bug before users complain.
- Separate costly endpoints (report generation, chain simulation) into stricter buckets than cheap health checks.
Database-heavy endpoints should also be optimized — a rate-limited query that still scans a million rows wastes disk I/O. Indexes and query plans matter as much as request counts; see our database indexing guide for the data-layer side of the same problem.
Designing quotas developers will tolerate
Opaque limits breed angry integrators. Document:
- Requests per second and per day for each tier.
- Whether limits are per key, per IP, or per organization.
- What happens at 80% of quota (warning header?) vs 100% (hard 429).
- How to request a higher tier and what metrics you use to approve it.
Return structured error bodies:
{"error":"rate_limit_exceeded","retry_after":12,"limit":100,"window":"1m"}
so SDKs can parse them. Machine-readable beats prose in a plain-text body.
Test your own limits in staging with load tools before launch. The goal is not to block legitimate traffic — it is to cap the tail risk so the API stays fast for everyone during a traffic spike.
Quick reference
| Algorithm | Burst friendly? | Typical use |
|---|---|---|
| Token bucket | Yes | Public REST APIs, RPC providers |
| Leaky bucket | No — smooth output | Strict downstream partners |
| Fixed window | Edge spikes at boundaries | Simple Redis counters |
| Sliding window | Moderate | High-accuracy SaaS APIs |
Related reading
- Solana RPC endpoints explained — 429 errors, fallbacks, and health checks
- Database indexing explained — reduce per-request DB cost under load
- HTTP caching explained — serve repeat reads without hitting origin
- All Solana Garden guides