Guide

A/B testing explained

Product teams argue about button colors, headline copy, and checkout flows because everyone has an opinion and nobody has data. A/B testing (split testing) replaces opinions with evidence: randomly assign users to a control (A) or treatment (B), measure a pre-defined outcome, and decide whether the difference is real or noise. Done rigorously, experiments compound into better conversion, retention, and revenue. Done casually — peeking at dashboards daily, changing the hypothesis mid-flight, ignoring guardrail metrics — they produce false winners that regress in production or quietly damage trust. This guide covers experiment design from hypothesis to ship decision, how sample size and statistical significance interact, when A/B differs from feature flags and canary deploys, and what to verify before you declare B the winner.

What A/B testing actually measures

An A/B test answers one causal question: if we change X, does metric Y move by at least Z? The change X might be a new pricing page, a shorter signup form, a different onboarding tooltip, or a game tutorial skip button. Metric Y is almost always a rate (conversion, click-through, retention at day 7) or a continuous value (average order value, session length). Z is the minimum detectable effect (MDE) — the smallest lift worth detecting given engineering and opportunity cost.

Random assignment is the engine. If users self-select into B because they are power users, you measure selection bias, not treatment effect. Stable randomization — usually hashing a user ID with a salt — keeps the same user in the same bucket across sessions. Server-side assignment is preferred when the experiment touches pricing, paywalls, or anything a client could manipulate.

A/B is not the only experiment shape. Multivariate tests vary several factors at once (headline AND image AND CTA) but need far more traffic because combinations multiply. Holdout groups keep a small percentage on control long after launch to measure cumulative lift. Switchback tests alternate treatments by time window — useful for marketplace or logistics when user-level randomization is impractical.

Hypothesis design: write it before you code

A usable hypothesis has four parts: audience, change, expected direction, and primary metric. Example: "For new mobile visitors, replacing the hero video with a static image will increase signup conversion because autoplay video increases bounce on slow networks." Vague goals like "make checkout better" fail because nobody agrees what success means when results arrive.

Metric tiers

  • Primary metric — the one metric the experiment is powered to detect. Pick exactly one; changing it after seeing data invalidates p-values.
  • Guardrail metrics — must not regress. Revenue per user, error rate, page load time, support ticket volume. A signup lift that doubles churn is not a win.
  • Secondary / diagnostic metrics — explain why the primary moved (scroll depth, time on step 2). Inform the next iteration; do not use them to claim victory without correction for multiple comparisons.

Pair quantitative metrics with observability dashboards so guardrails surface automatically. For content and SEO experiments, tie primary metrics to business outcomes (qualified leads, purchases) not vanity clicks — see SEO fundamentals for why ranking alone is a weak primary metric.

Sample size, power, and how long to run

Before launch, estimate how many users each variant needs. Inputs: baseline conversion rate, MDE (e.g. relative 5% lift from 4.0% to 4.2%), significance level (usually 95%, alpha = 0.05), and statistical power (usually 80% — probability of detecting a real effect). Online calculators and R/Python libraries (statsmodels, pwr) produce per-arm sample sizes; double it for 50/50 A/B.

Low-traffic products face hard tradeoffs. Detecting a 1% relative lift on a 2% baseline may need hundreds of thousands of users. Either accept a larger MDE, run longer (watch seasonality), or use Bayesian methods with explicit priors — but do not pretend a 200-user test can detect a 0.1 point conversion bump.

Runtime rules

  • Pre-register duration — end at N users or calendar date chosen before peeking.
  • Full weeks — include weekday/weekend cycles; B2B and consumer patterns differ.
  • No mid-flight edits — changing B's copy after day 3 mixes treatments and voids analysis.
  • Exclude contaminated users — bots, employees, users who saw both variants due to cookie loss.

Statistical significance without fooling yourself

After the pre-registered stop point, compare variants. For conversion rates, a two-proportion z-test or chi-square test is standard. The resulting p-value answers: if A and B were truly identical, how often would we see a gap this large by chance? p < 0.05 is convention, not magic — it means a 5% false-positive rate if you run many independent tests.

Report confidence intervals on the lift, not just p-values. "B improved signup by 0.8 percentage points (95% CI: 0.2 to 1.4)" communicates effect size and uncertainty. A statistically significant 0.05 point lift on a low-margin funnel may not justify maintenance of two code paths.

The peeking problem

Checking results every morning and stopping when p < 0.05 inflates false positives dramatically. If you must monitor for harm (guardrails), use sequential testing methods or fixed peeking schedules with alpha spending corrections — or treat early looks as operational only, not ship decisions. Many teams use experimentation platforms (Optimizely, Statsig, GrowthBook, Eppo) that implement valid sequential or Bayesian stopping rules.

Multiple comparisons and Simpson's paradox

Testing five metrics at alpha 0.05 gives roughly a 23% chance at least one false positive. Pre-specify primaries; apply Bonferroni or false-discovery-rate control on exploratory cuts. Segment analysis (mobile only, US only) multiplies the problem — segment only if powered, or treat segments as hypothesis-generating.

Simpson's paradox: B wins overall but loses in every segment because traffic mix shifted. Always check segment consistency and use stratified randomization when key cohorts are small.

Implementation: flags, events, and analysis

Most products implement A/B via feature flags plus an analytics pipeline. Flow: assign variant at first exposure, log experiment_id, variant, and user_id to your warehouse; join exposure events to outcome events (purchase, signup) with clear attribution windows.

  • Exposure logging — record when the user actually saw B, not when the flag evaluated true server-side on an API they never hit.
  • Intent-to-treat vs per-protocol — ITT analyzes all assigned users (conservative, recommended); per-protocol drops non-exposed users (optimistic, easy to bias).
  • Sticky assignment — same variant across devices requires logged-in ID, not ephemeral cookies alone.
  • Mutual exclusion — users in checkout experiment should not simultaneously land in unrelated pricing test without layer configuration.

For infrastructure changes (new API version, caching layer), pair canary deploys with error-rate guardrails rather than user-facing conversion alone. Game studios often run live ops experiments on economy knobs — see game analytics and retention for cohort metrics that matter more than single-session spikes.

A/B vs other "testing" you already do

Practice Question it answers Typical owner
Unit / integration tests Does code behave correctly under known inputs? Engineering
Canary / blue-green deploy Does new build crash or raise error rates? Platform / SRE
Feature flag rollout Can we limit blast radius and kill quickly? Engineering / ops
A/B test Does change improve a business metric causally? Product / growth / data
User research / usability Why do users struggle? What do they intend? Design / research

These layers stack. Ship behind a flag, canary for stability, A/B for lift proof, usability sessions for qualitative why. None replaces the others.

When to run an A/B test (decision table)

Situation A/B test? Alternative
High-traffic funnel change with reversible implementation Yes Pre-register metrics and MDE
< 1k weekly exposed users, small expected lift Rarely — underpowered Qualitative research, bigger swing, or pooled holdout
Legal, accessibility, or security fix No — ship Monitor guardrails post-launch
Network effects (marketplace liquidity) Careful — user-level randomization may bias Geo or switchback designs
Long lag outcome (annual retention) Proxy metrics + long holdout Survival analysis, pre-registered surrogates
Many simultaneous UI tweaks Multivariate or sequential A/Bs Factorial design if traffic allows

Common mistakes

  • Peeking and early stop — the most common source of false winners; pre-register or use sequential methods.
  • Changing primary metric post hoc — "signup didn't move but clicks did" is p-hacking unless clicks were pre-registered.
  • Ignoring novelty and seasonality — B wins week one because it is new, then reverts; run through a full cycle.
  • Underpowered tests declared "no effect" — absence of significance is not evidence of absence; check CI width.
  • Simpson's paradox in segments — aggregate win hides segment losses; investigate before global rollout.
  • Contaminated assignment — CDN cache serves A HTML with B API; test end-to-end consistency.
  • No guardrails — conversion up, revenue down because discount visibility changed.
  • Forever experiments — ship winner, remove dead code, document learnings; flag debt hurts velocity.

Production checklist

  • Written hypothesis with audience, change, direction, and single primary metric.
  • Guardrail and secondary metrics defined before launch.
  • Sample size calculated for target MDE, alpha, and power.
  • Randomization unit and stickiness documented (user ID, device, account).
  • Exposure and outcome events logged to warehouse with experiment metadata.
  • Pre-registered end date or sample count; peeking policy documented.
  • Segment analysis plan limited to pre-specified cohorts.
  • Ship/kill decision includes confidence interval on business impact, not p-value alone.
  • Winner rolled out; loser code removed; results archived for institutional memory.

Key takeaways

  • A/B testing estimates causal lift by randomly assigning users to control and treatment and comparing a pre-defined primary metric.
  • Design beats tooling — hypothesis, MDE, sample size, and guardrails matter more than which platform you buy.
  • Peeking inflates false positives — pre-register stopping rules or use sequential/Bayesian methods honestly.
  • Feature flags deploy variants; A/B analysis proves whether the variant should become default.
  • Ship decisions need effect size — statistical significance without practical significance wastes engineering time.

Related reading