Guide

Causal inference explained

Users who receive your promotional email purchase twice as often as users who do not. Does the email cause the lift — or did you target customers who were already likely to buy? That distinction is the heart of causal inference: estimating what would happen if you intervened (sent the email, changed a price, shipped a feature) rather than merely observing correlations. Causal thinking sits beneath A/B testing, marketing attribution, policy evaluation, and modern machine learning systems that optimize for uplift, not clicks. This guide covers correlation versus causation, directed acyclic graphs (DAGs), confounding and collider bias, the average treatment effect (ATE), randomized experiments as the gold standard, observational methods (matching, propensity scores, regression adjustment), quasi-experiments (difference-in-differences, regression discontinuity, instrumental variables), causal ML meta-learners, a worked email-campaign example, a method-selection decision table, common pitfalls, and a practitioner checklist. For the statistical machinery underneath hypothesis tests, see our hypothesis testing guide.

Correlation is not causation — what causality means

A causal effect answers a counterfactual question: if we had changed treatment status for the same unit, holding everything else fixed, how would the outcome differ? Formally, for unit i, the individual treatment effect is Yi(1) − Yi(0) — outcome under treated minus outcome under control. You never observe both for the same person at the same time; that is the fundamental problem of causal inference.

Correlation measures association in observed data. Causation requires an identification strategy — a credible reason to believe that adjusting for the right variables (or randomizing) isolates the effect of treatment on outcome. Without that, predictive models may fit beautifully yet recommend actions that fail in production because they learned spurious patterns.

The estimand most teams report is the average treatment effect (ATE): the mean difference in outcomes if everyone received treatment versus if everyone received control. Related quantities include the conditional ATE (CATE) — effect for a subgroup — which powers personalized targeting and uplift models.

Causal diagrams: DAGs, confounders, and colliders

A directed acyclic graph (DAG) sketches variables as nodes and causal arrows as edges. Drawing the DAG before analyzing data forces you to name what influences what — and which paths create bias.

Confounders

A confounder affects both treatment and outcome. In the email example, past purchase frequency may cause both receiving the email (marketers target loyal buyers) and future purchases. A naive comparison of email recipients vs non-recipients overstates the email effect because it conflates targeting with impact. Fix: block the confounder — condition on it, stratify, match, or include it in a regression model.

Colliders and selection bias

A collider is influenced by both treatment and outcome. Conditioning on a collider opens a spurious path. Example: conditioning analysis on “users who completed onboarding” when both email exposure and purchase intent affect completion. Collider bias explains why post-hoc slices in dashboards can invert true effects.

Mediators and total vs direct effects

A mediator lies on the causal path from treatment to outcome (email → site visit → purchase). Controlling for mediators blocks part of the effect you may want to measure. Decide upfront whether you need the total effect (policy relevance) or direct effect (mechanism study).

Gold standard: randomized controlled experiments

Random assignment makes treatment independent of potential outcomes and observed confounders. That is why A/B tests work: if allocation is truly random and exposure is measured correctly, the difference in mean outcomes estimates the ATE without modeling covariates.

  • Intent-to-treat (ITT) compares groups as randomized, even if some treated users never saw the email. ITT is policy-relevant (“what happens if we launch this campaign?”).
  • Per-protocol / complier average causal effect (CACE) isolates effect among users who actually received treatment — requires careful handling of non-compliance.
  • Power and duration follow the same sample-size logic as hypothesis testing — underpowered tests yield inconclusive results, not “no effect.”

When you can randomize, do. Observational methods are for when ethics, cost, or physics forbid it — not because regression is easier.

Observational methods when you cannot randomize

Regression adjustment

Include confounders in a model: Y = β₀ + β₁T + β₂X + ε. If the model is correctly specified and all confounders are measured, β₁ estimates the ATE. In practice, functional form matters; logistic/propensity models and doubly robust estimators add safety.

Matching and propensity scores

The propensity score is the probability of treatment given covariates: e(X) = P(T=1 | X). Match treated and control units with similar scores, or weight units inversely by propensity. This balances observed confounders but cannot fix unmeasured ones — sensitivity analysis is mandatory.

Inverse probability weighting (IPW)

Weight each unit by 1/e(X) for treated and 1/(1−e(X)) for controls. IPW reweights the sample toward a pseudo-randomized population. Combine with outcome regression for doubly robust estimation — consistent if either the propensity or outcome model is correct.

Difference-in-differences (DiD)

Compare change over time in a treated group vs a control group that shares parallel trends absent treatment. Common in policy and pricing rollouts. Requires a credible parallel-trends assumption — test pre-period trends and document shocks that hit only one group.

Regression discontinuity (RDD)

Treatment assigned by a cutoff (credit score ≥ 700 gets a loan). Compare units just above vs just below the threshold. Local randomization near the cutoff can identify a causal effect for marginal cases — not necessarily for the full population.

Instrumental variables (IV)

An instrument affects treatment but influences outcome only through treatment (e.g., random encouragement to open an app). IV estimates a local average treatment effect (LATE) for compliers — not the ATE for everyone. Weak instruments produce unstable estimates.

Causal machine learning

Standard supervised learning predicts Y; causal ML estimates heterogeneous treatment effects. Common meta-learners:

  • T-learner — separate models for treated and control; difference in predictions is CATE estimate.
  • S-learner — one model with treatment as a feature; compare predictions with T toggled.
  • X-learner — two-stage approach robust when treatment is imbalanced.
  • Causal forests / DR-learner — doubly robust trees for heterogeneous effects with honest splitting.

Use causal ML when the business question is who to treat, not whether treatment works on average. Validate with held-out randomized data when possible; observational uplift models drift when targeting policies change. Pair with Bayesian updating when effect sizes are small and prior knowledge is strong.

Worked example: promotional email and purchases

Setting: A retailer emails 20% of its customer base. Raw data shows a 12% purchase rate among email recipients vs 6% among non-recipients — a naive +6 percentage-point lift. Marketing wants credit for the campaign.

Step 1 — Draw the DAG. Past 90-day spend and loyalty tier influence both email assignment and future purchases. Email may also cause site visits (mediator) and purchases.

Step 2 — Check balance. Recipients have 3× higher historical spend. Confounding is present; naive comparison is biased upward.

Step 3 — Estimate propensity scores from spend, tenure, and category preferences. Match each treated user to the nearest control user on score. After matching, covariates balance within 0.1 standardized mean difference.

Step 4 — Estimate ATE. Matched sample shows 8.5% vs 7.8% purchase rates — adjusted lift ≈ +0.7 pp, far below the naive +6 pp. A doubly robust estimator with logistic propensity and gradient-boosted outcome model gives +0.9 pp (95% CI: 0.2 to 1.6 pp).

Step 5 — Decision. Email has a small but positive causal effect after accounting for targeting. ROI depends on email cost and margin per order. Run a randomized holdout next quarter to confirm and measure CATE for high-value segments.

Method selection decision table

Scenario Preferred method Key assumption
Can randomize treatment A/B test (ITT) Valid randomization, stable exposure
Observational, rich covariates Propensity matching / IPW / doubly robust No unmeasured confounders (sensitivity check)
Policy change at known date, control region Difference-in-differences Parallel pre-trends
Score cutoff assigns treatment Regression discontinuity No manipulation at cutoff
Non-compliance, valid instrument Instrumental variables (2SLS) Exclusion restriction holds
Who should receive treatment? Causal ML (T-/X-learner, causal forest) Same identification as above + honest validation

Common pitfalls

  • Trusting naive before/after — seasonality and maturation mimic treatment effects.
  • Ignoring targeting bias — marketing and product teams rarely assign treatment at random.
  • Conditioning on colliders — slicing on engagement or conversion funnels opens false paths.
  • Over-controlling mediators — blocks the effect you want to measure for go/no-go decisions.
  • Unmeasured confounding — no statistical test proves absence; run sensitivity analysis (e.g., Rosenbaum bounds).
  • Extrapolating IV or RDD estimates — LATE and local RDD effects may not generalize to all users.
  • Peeking and optional stopping — same problem as in frequentist A/B tests; pre-register analysis plans.
  • Uplift model drift — retrain when assignment policy or population shifts.

Practitioner checklist

  • State the causal question and estimand (ATE, CATE, LATE) before touching data.
  • Draw a DAG with treatment, outcome, confounders, mediators, and colliders.
  • Prefer randomization when feasible; document why if not.
  • Check covariate balance between treated and control groups.
  • Report adjusted estimates with confidence intervals, not only p-values.
  • Run sensitivity analysis for unmeasured confounding on observational claims.
  • Separate total effect (launch decision) from direct effect (mechanism).
  • Validate uplift models on randomized holdouts before production targeting.
  • Pre-register analysis plans for high-stakes decisions.

Key takeaways

  • Causal inference estimates effects of interventions; prediction alone cannot answer “what if we change X?”
  • DAGs clarify which variables to adjust for — and which slices to avoid.
  • Randomized experiments are the cleanest identification strategy; observational methods require explicit assumptions.
  • Propensity scores, DiD, RDD, and IV each fit different real-world constraints — choose by design, not convenience.
  • Pair causal estimates with rigorous experimentation and honest reporting of uncertainty.

Related reading