Guide

Statistical significance and hypothesis testing explained

A headline reads: "New drug shows statistically significant improvement." A product manager declares: "Variant B won — p = 0.03." Both statements use the same vocabulary but often mean different things, and both are frequently misinterpreted. Hypothesis testing is the formal framework for deciding whether observed data is compatible with a default claim of no effect, and statistical significance is the threshold at which you reject that default. Neither term tells you effect size, practical importance, or probability that your hypothesis is true. This guide covers null and alternative hypotheses, p-values and alpha, Type I and Type II errors, power and sample size, confidence intervals, the tests you will actually run (proportions, means, contingency tables), the multiple-comparisons trap, how this framework differs from Bayesian inference, how it underpins A/B testing and bandit experiments, a worked conversion-rate example, a decision table, common pitfalls, and a practitioner checklist.

What hypothesis testing answers

Classical (frequentist) hypothesis testing answers a narrow question: if there were truly no difference between groups, how often would we see data at least this extreme purely by chance? It does not directly answer "what is the probability my hypothesis is correct?" or "how big is the effect?" Those require additional tools — effect-size estimates, confidence intervals, and often Bayesian posteriors.

Every test starts with two competing statements. The null hypothesis (H₀) is the skeptical default — usually "no difference," "no association," or "the treatment has zero effect." The alternative hypothesis (H₁ or H_a) is what you seek evidence for: a non-zero difference, a correlation, or an improvement above a benchmark. You collect data, compute a test statistic, and ask how surprising that statistic would be if H₀ were true.

Significance is a decision rule, not a discovery. You pre-specify an significance level α (conventionally 0.05). If the computed p-value is at or below α, you reject H₀ and call the result statistically significant. If p > α, you fail to reject H₀ — which is not the same as proving the null is true; you may simply lack power to detect a real but small effect.

p-Values: what they mean and what they do not

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) what you measured, assuming H₀ is true and your model is correct. A p-value of 0.04 means: if there were genuinely no effect, results this lopsided would occur about 4% of the time by random sampling variation alone.

Common misreadings to avoid:

Not "there is a 4% chance H₀ is true."
Not "there is a 96% chance the treatment works."
Not proof of practical importance — a significant result can describe a tiny effect with a huge sample.
Not a substitute for pre-registration — p-hacking and optional stopping inflate false positives dramatically.

Report p-values alongside confidence intervals and effect sizes. A conversion lift from 10.0% to 10.3% can be p = 0.001 with a million users and economically meaningless; a lift from 2% to 5% might be p = 0.08 with five hundred users but worth pursuing with more data.

Type I and Type II errors, alpha, beta, and power

Hypothesis tests trade two kinds of mistakes:

Type I error (false positive): rejecting H₀ when it is actually true. Controlled by α — if you use α = 0.05, you accept up to a 5% false-positive rate when H₀ is true and assumptions hold.
Type II error (false negative): failing to reject H₀ when a real effect exists. Its probability is β. Power = 1 − β is the chance of detecting an effect of a given size when it is really there.

Power depends on four levers: sample size (larger n increases power), effect size (bigger true differences are easier to detect), significance level (raising α from 0.05 to 0.10 increases power at the cost of more false positives), and measurement variance (noisier metrics need more data). Before running an A/B test, fix your minimum detectable effect (MDE) and compute required sample size so power is typically 80% or higher.

Decision	H₀ true (no effect)	H₀ false (real effect)
Reject H₀ (significant)	Type I error (rate = α)	Correct (power = 1 − β)
Fail to reject H₀	Correct	Type II error (rate = β)

Confidence intervals

A 95% confidence interval is a range of parameter values compatible with your data at the chosen confidence level. If you repeated the experiment many times, 95% of such intervals would contain the true parameter — but for any single interval, the true value is either inside or outside; the interval does not have a 95% probability of containing it (that phrasing is Bayesian).

For a difference in conversion rates, the interval might be [+0.2%, +1.8%]. If the entire interval lies above zero, your two-sided test at α = 0.05 will reject H₀ of no difference. Intervals communicate magnitude and uncertainty in one number — far more actionable than a bare p-value. Pair them with business thresholds: "We need at least +1% absolute lift to justify the engineering cost."

Common tests and when to use them

Choose a test based on your outcome type and assumptions:

Scenario	Typical test	Key assumptions
Compare two proportions (click-through, conversion)	Two-proportion z-test or chi-square on 2×2 table	Independent samples, np and n(1−p) roughly ≥ 5 per cell
Compare two means (revenue per user, latency)	Two-sample t-test (Welch if unequal variance)	Approximately normal or large n; watch heavy tails and outliers
Paired before/after on same users	Paired t-test or sign test	Differences (not raw values) are the unit of analysis
Association in contingency table (segment × outcome)	Chi-square test of independence or Fisher exact (small counts)	Independent observations; expected counts not too small
Compare >2 group means	One-way ANOVA, then post-hoc tests with correction	Normality and homogeneity of variance (or robust alternatives)

For skewed metrics like revenue, consider log transforms, trimmed means, or nonparametric tests (Mann-Whitney). For model comparison on held-out labels, use metrics from precision-recall evaluation with bootstrap confidence intervals rather than treating accuracy as a simple proportion without class imbalance context.

Multiple comparisons and p-hacking

Test twenty metrics at α = 0.05 and you expect one false positive even if nothing changed. That is the multiple comparisons problem. Mitigations include:

Pre-register one primary metric before the experiment starts — guardrails are monitored but not used to claim victory.
Bonferroni correction: use α/m for m tests (conservative).
Benjamini-Hochberg (FDR): control expected fraction of false discoveries among rejected hypotheses — common in genomics and large dashboards.
Hierarchical testing: test secondary metrics only if the primary is significant.

p-hacking — stopping early when p < 0.05, redefining the metric, excluding "bad" days, or running until significance — destroys the meaning of α. Use fixed-horizon analysis or sequential methods with adjusted boundaries. The peeking traps in our A/B testing guide apply directly here.

Frequentist vs Bayesian framing

Frequentist tests treat parameters as fixed unknowns and data as random. Bayesian methods (see Bayesian inference) put distributions on parameters and update them with data, yielding statements like "there is a 92% posterior probability the treatment lift exceeds 1%." Neither camp is universally superior:

Frequentist strengths: standardized regulatory acceptance, clear error-rate control in repeated sampling, simple tooling in experiment platforms.
Bayesian strengths: natural incorporation of prior knowledge, intuitive probability statements, flexible stopping rules via posterior monitoring, Thompson sampling in bandits.

Many product teams use frequentist significance for ship/no-ship gates and Bayesian posteriors for internal prioritization — as long as the team knows which language each decision uses.

Worked example: checkout button color

An e-commerce site tests a green checkout button (B) against the blue control (A). After 50,000 users per arm: A converts 1,200/50,000 = 2.40%; B converts 1,350/50,000 = 2.70%. Absolute lift = +0.30 percentage points; relative lift = +12.5%.

A two-proportion z-test yields p ≈ 0.04 — significant at α = 0.05. The 95% CI for the difference is roughly [+0.01%, +0.59%]. Interpretation:

Reject H₀ of equal conversion — the data are unlikely under no effect.
The interval is wide on the low end — a near-zero lift is still plausible; business case needs the upper half of the range.
Annualized revenue impact = lift × traffic × AOV — a 0.3 pp lift on 10M annual visitors at $80 AOV is ~$2.4M if sustained; verify on holdout and watch guardrail metrics (refunds, support tickets).

If the same lift appeared with only 2,000 users per arm, p would be far above 0.05 — not because the effect vanished, but because power was insufficient. That is why sample-size planning precedes launch, not post-hoc rationalization.

Decision table: which tool when

Goal	Tool	Watch out for
Ship/no-ship on one pre-registered metric	Fixed-horizon test + CI at α = 0.05	Peeking, changing MDE mid-flight
Estimate how big the effect might be	Confidence interval, Bayesian credible interval	Confusing interval width with probability of truth
Screen many variants continuously	Multi-armed bandit, sequential testing	Comparing bandit winners to fixed-horizon p-values
Compare model classifiers on rare positives	PR-AUC, bootstrap CIs, McNemar on paired errors	Accuracy and ROC alone on imbalanced data
Monitor many dashboard metrics weekly	FDR control, anomaly detection baselines	Declaring victory on any single spike

Common pitfalls

Significance obsession: ignoring effect size and cost of implementation.
NHST ritual: running a test because "that is what we always do" without checking assumptions.
Underpowered studies: "not significant" interpreted as "no effect" when n was too small.
Multiple metrics fishing: celebrating whichever metric crossed p < 0.05.
Ignoring clustering: users within accounts or repeated measures violate independence — inflate false positives or use mixed models.
Confusing statistical and practical significance: a 0.01% lift can be p < 0.001 at scale but not worth a six-month rewrite.

Practitioner checklist

State H₀ and H₁ in plain language before collecting data.
Pre-register primary metric, MDE, α, target power, and planned sample size.
Verify independence, adequate cell counts, and metric distribution assumptions.
Report effect size, 95% CI, and p-value together — not p alone.
Apply multiple-comparison correction when testing more than one hypothesis.
Replicate surprising wins on holdout traffic or a follow-up experiment.
Document optional stopping rules if you must peek — use sequential methods.

Key takeaways

Hypothesis testing asks how surprising your data would be if there were no real effect — not whether your hypothesis is probably true.
p-values control false positives at rate α only under correct assumptions and honest analysis plans.
Power and sample size determine whether you can detect effects that matter; underpowered tests waste traffic.
Confidence intervals communicate magnitude and uncertainty better than binary significant/not labels.
Pair frequentist gates with business thresholds; consider Bayesian posteriors when priors and continuous monitoring help.