Guide

Load testing explained: baseline, stress, soak, and spike tests

Load testing simulates real user traffic against your application to answer a question unit tests cannot: what happens when hundreds or thousands of people hit this at once? A checkout API that passes every functional test can still melt down under concurrent writes, exhaust database connections, or queue payouts until latency crosses your SLO and users abandon the flow. This guide explains the four main test types, the metrics that actually matter (requests per second, error rate, and latency percentiles), how to design realistic scenarios with tools like k6 and Locust, interpret saturation curves, and pair load tests with observability so you fix bottlenecks instead of guessing.

Load testing vs other performance tests

Performance testing is an umbrella term. Teams often conflate types and draw wrong conclusions — a spike test that passes does not mean an eight-hour Black Friday will survive. Know which question each test answers:

Load (baseline) test — sustain expected peak traffic for a fixed window. Validates capacity at the traffic level you plan for today.
Stress test — ramp load beyond expected peak until something breaks. Finds the knee of the curve: where latency explodes or errors spike.
Soak (endurance) test — moderate load for hours or days. Catches memory leaks, connection pool exhaustion, log disk fill, and GC pause creep that short bursts hide.
Spike test — sudden burst from low to very high concurrency. Models viral links, flash sales, or bot swarms hitting one endpoint.

Functional software testing proves correctness for one user at a time. Load testing proves the system still behaves acceptably when many users compete for the same CPU, locks, and I/O. Both belong in a mature release pipeline — neither replaces the other.

Metrics that matter

Vanity dashboards showing average latency are misleading. Under load, a long tail of slow requests ruins user experience even when the mean looks fine. Track these service-level indicators consistently:

Throughput and errors

Requests per second (RPS) or transactions per second measures how much work the system completes. Pair it with error rate — HTTP 5xx, timeout fractions, and business-level failures (payment declined because the ledger lock timed out). A system that maintains RPS only by returning errors is not healthy.

Latency percentiles

Report p50 (median), p95, and p99 — not just averages. If p99 checkout latency is 8 seconds while p50 is 200 ms, one in a hundred users waits long enough to churn. SLO targets are usually written on p95 or p99 (e.g. “99% of API calls complete under 300 ms”). Compare percentiles across load steps: a flat p50 with a climbing p99 signals queueing or lock contention.

Saturation signals

Watch resource utilization on the server side while the test runs: CPU, memory, database connections in use, disk I/O wait, thread pool queue depth, and RPC rate-limit counters. The bottleneck is rarely where you first guess — a “slow API” often traces to an N+1 query or an unindexed lookup under concurrency, not insufficient application replicas.

Designing realistic scenarios

Synthetic tests that hammer one GET endpoint with identical payloads teach you how fast nginx returns static JSON — not how your product behaves. Model user journeys: browse, authenticate, add to cart, pay, poll status. Mix read and write paths in realistic ratios (often 80/20 or 90/10 depending on the product).

Think time and pacing

Real users pause between clicks. Scripts should include think time (random sleeps between steps) so you do not accidentally simulate a botnet that fires requests with zero gap — that stress-tests the client generator as much as the server. Ramp up gradually: start at 10% of target RPS, hold, increase in steps. Sudden step-functions hide warm-up effects and trigger false positives on cold JVMs or empty caches.

Test data and auth

Shared test accounts create artificial lock contention. Generate distinct user IDs, session tokens, or wallet addresses per virtual user. Seed databases with enough catalog rows that queries do not always hit the same hot partition. For authenticated APIs, rotate tokens the way production would — not one bearer token reused by ten thousand workers unless that is literally your threat model.

Environment fidelity

Run against a production-like staging environment: similar instance sizes, same database engine and major version, TLS termination path, and feature flags. Testing a single-container dev stack proves little about load balancer behavior or replica count. Never run destructive stress tests against production without explicit guardrails, traffic shadowing, and executive sign-off — a load test that takes down prod is an incident, not a metric.

Tools: k6, Locust, JMeter, and Gatling

Tool choice matters less than scenario quality and observability integration. Common options:

k6 — JavaScript DSL, excellent CLI and CI integration, built-in thresholds (fail the build if p95 > 500 ms). Strong choice for API and microservice teams.
Locust — Python-based, distributed workers, good for complex custom logic in test scripts. Popular when QA already lives in Python.
JMeter — GUI-heavy, vast plugin ecosystem, steeper CI story. Still common in enterprise QA departments.
Gatling — Scala DSL, strong reporting, good for high-concurrency HTTP scenarios with detailed HTML reports.

Whichever you pick, run generators from multiple machines or cloud regions if you test global traffic — a single laptop caps out on outbound connections and skews results. Coordinate with rate limiting and WAF rules so the test measures your app, not the edge blocking your own IPs.

Reading results: the saturation curve

Plot RPS (or concurrent users) on the x-axis and p95 latency on the y-axis. Healthy systems show a gentle slope until they approach capacity, then a sharp knee where latency and errors accelerate — the saturation point. Your goal is to know where that knee sits relative to planned peak plus headroom (often 30–50% above expected peak for unexpected bursts).

If latency climbs but CPU stays low, suspect I/O: database, external APIs, or disk. If CPU pegs but latency is fine, you may simply need more replicas — horizontal scale works until shared state (one primary database) becomes the choke point. If errors spike with “connection pool timeout” or “too many open files,” fix pool sizing and OS limits before buying larger machines.

Correlate the test timeline with server metrics and traces. A spike in p99 at minute twelve that aligns with GC pauses or a migration cron job is a different fix than adding API servers.

Common mistakes

Testing the cache, not the app — first-run cold caches look terrible; cached-only runs look artificially fast. Warm up, then measure.
Ignoring downstream contracts — hammering Stripe or Solana RPC in a load test burns rate limits and produces errors unrelated to your code.
Single-generator bottleneck — when the load tool itself hits CPU limits, reported RPS caps falsely.
No pass/fail thresholds in CI — manual “looks OK” reviews drift. Encode p95 and error-rate gates in the pipeline.
One-shot hero tests — running load only before a big launch without regression baselines makes comparisons meaningless. Store historical results.
Confusing stress with soak — passing a ten-minute stress test does not prove overnight stability.

Decision table: which test when

Goal	Test type	Typical duration
Validate Black Friday capacity plan	Baseline load	30–60 minutes at expected peak RPS
Find maximum safe throughput	Stress / breakpoint	Ramp until error budget would burn
Catch memory leaks before prod	Soak	8–72 hours at 60–80% of peak
Simulate Hacker News front page	Spike	0 to max VUs in seconds, hold briefly
Verify deploy did not regress perf	Baseline smoke load	5–10 minutes in CI/staging

Production checklist

Define target RPS and concurrent users from analytics or business forecasts.
Write SLO-aligned thresholds (p95 latency, max error rate) before running tests.
Build user-journey scripts with think time, auth variety, and read/write mix.
Provision staging that mirrors prod topology (replicas, DB size, CDN path).
Warm caches and connection pools; discard warm-up data from reports.
Run generators from distributed nodes; monitor generator CPU and network.
Capture metrics, logs, and traces on the system under test during the run.
Store results over time; fail CI on regression beyond agreed thresholds.
Schedule soak tests before major events; stress tests after large arch changes.
Document findings and capacity headroom; update autoscaling and pool limits.

Key takeaways

Load, stress, soak, and spike tests answer different questions — use the right one.
p95 and p99 latency under load matter more than average response time.
Realistic scenarios (journeys, think time, varied data) beat single-endpoint hammering.
Find the saturation knee and compare it to peak traffic plus headroom.
Pair tests with observability and SLOs so results drive concrete fixes.