Guide
Markov chain Monte Carlo (MCMC) explained
Harbor Analytics ran a checkout redesign test across 12 geographic regions. Each region had thin traffic — 200–800 sessions per week — so independent Beta-Binomial estimates swung wildly. A hierarchical Bayesian model that pooled information across regions (partial pooling) produced sensible regional posteriors, but the joint posterior over 24 conversion rates plus hyperparameters had no closed form. Analysts switched from grid approximations to Markov chain Monte Carlo (MCMC), drawing 40,000 correlated samples in under 90 seconds on a laptop. False-positive “winner” calls dropped 34% versus per-region frequentist tests. MCMC is a family of algorithms that construct a Markov chain whose long-run distribution equals a target you cannot sample from directly — typically a Bayesian posterior. This guide covers the core intuition, Metropolis-Hastings and Gibbs sampling, Hamiltonian Monte Carlo (HMC) at a high level, burn-in and convergence diagnostics, a Harbor Analytics hierarchical A/B refactor, a method decision table, pitfalls, and a production checklist.
Why posteriors become intractable
Bayes’ theorem is simple on paper:
p(\theta \mid D) \propto p(D \mid \theta)\, p(\theta).
For conjugate pairs (Beta-Binomial, Normal-Normal) you integrate analytically
and get a known distribution. Real models break that convenience quickly:
- Hierarchical structure — region-level rates drawn from a shared hyperprior; the joint posterior couples every region.
- Nonlinear likelihoods — logistic regression, survival models, or neural network weights have no closed-form posterior.
- Latent variables — mixture models, topic models, and item-response models integrate over hidden assignments.
- Constraints and combinatorics — order statistics, change-point models, or spatial random fields create awkward normalizing constants.
The normalizing constant
Z = \int p(D \mid \theta)\, p(\theta)\, d\theta is often the
hard part: you may evaluate the unnormalized posterior
\tilde{p}(\theta) = p(D \mid \theta)\, p(\theta) pointwise,
but not integrate it. MCMC sidesteps explicit integration: it generates
samples whose empirical distribution approximates
p(\theta \mid D).
MCMC intuition: ergodic sampling
Imagine you want to know the shape of a mountain lake (the posterior) but cannot measure depth everywhere. MCMC sends a boat that:
- Starts at a random point on the water.
- Proposes small moves according to simple rules.
- Accepts moves that visit high-probability regions more often.
- After many steps, the boat’s path visits each depth in proportion to how common that depth is.
Formally, you build a Markov chain
\theta^{(0)}, \theta^{(1)}, \theta^{(2)}, \ldots with
transition kernel K(\theta' \mid \theta) such that the
stationary distribution is the target posterior. Under
ergodicity conditions (irreducibility, aperiodicity, detailed balance or
appropriate Metropolis balance), time averages converge to posterior
expectations:
\mathbb{E}[f(\theta) \mid D] \approx \frac{1}{N}\sum_{t=1}^{N} f(\theta^{(t)})
Unlike independent Monte Carlo (draw i.i.d. from the posterior), MCMC
samples are correlated. That correlation is the price of not
knowing Z. Diagnostics like effective sample size (ESS) and
\hat{R} (Gelman-Rubin) quantify how much information you
actually collected.
Metropolis-Hastings
The workhorse MCMC algorithm. At step t:
- Propose a candidate
\theta' \sim q(\theta' \mid \theta^{(t)})from a simple distribution (random walk Gaussian, uniform perturbation). - Compute acceptance ratio
\alpha = \min\!\left(1,\; \frac{\tilde{p}(\theta')\, q(\theta^{(t)} \mid \theta')}{\tilde{p}(\theta^{(t)})\, q(\theta' \mid \theta^{(t)})}\right). NoticeZcancels — you only need ratios of unnormalized posteriors. - Accept or reject: with probability
\alpha, set\theta^{(t+1)} = \theta'; otherwise\theta^{(t+1)} = \theta^{(t)}.
When the proposal is symmetric (q(\theta' \mid \theta) = q(\theta \mid \theta')),
this reduces to the original Metropolis algorithm. Tuning the proposal
scale is critical: too small and the chain moves slowly (high autocorrelation);
too large and almost every proposal is rejected. Adaptive samplers during
burn-in can auto-tune step size toward an acceptance rate near 0.2–0.4
in moderate dimensions.
Metropolis-Hastings is general — it works on any model where you can
evaluate \log \tilde{p}(\theta) — but high-dimensional
random-walk proposals explore space inefficiently. That is where Gibbs
sampling and HMC help.
Gibbs sampling
When the full posterior is hard but conditional posteriors are tractable, Gibbs sampling cycles through coordinates or blocks:
- Sample
\theta_1 \sim p(\theta_1 \mid \theta_2^{(t)}, \theta_3^{(t)}, D) - Sample
\theta_2 \sim p(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, D) - Sample
\theta_3 \sim p(\theta_3 \mid \theta_1^{(t+1)}, \theta_2^{(t+1)}, D) - Repeat.
Each step is an exact draw from a conditional — acceptance is always 1. Gibbs shines in conjugate hierarchical models: Normal-Normal hierarchies, linear regression with known variance, some mixture models with data augmentation. When a conditional lacks a closed form, you embed a Metropolis-Hastings step inside the cycle (Metropolis-within-Gibbs).
The downside: strong posterior correlation between parameters can make
Gibbs crawl. If \theta_1 and \theta_2 are
nearly deterministically linked given data, alternating updates move
along a narrow ridge. Reparameterization (non-centered parameterizations
in hierarchical models) or joint block updates via HMC often fix this.
Hamiltonian Monte Carlo and modern samplers
Hamiltonian Monte Carlo (HMC) treats the log-posterior as a potential energy landscape and simulates Hamiltonian dynamics with momentum variables. Gradient information guides long, informed leaps instead of blind random walks. No-U-Turn Sampler (NUTS), the default in Stan, PyMC, and NumPyro, adaptively chooses trajectory length and step size.
HMC/NUTS dominates Bayesian workflow tools today because it scales better to hundreds of parameters and handles the funnel geometries that break Gibbs. You need differentiable log-posteriors (automatic differentiation through PyTorch/JAX) and should watch for divergences — they signal regions the sampler cannot explore faithfully, often from bad parameterization.
Other variants matter in specialized settings: slice sampling for univariate adaptive stepping; elliptical slice sampling for Gaussian priors with Gaussian likelihoods; sequential Monte Carlo when the target is a filtering distribution over time rather than a static posterior; and parallel tempering for multimodal posteriors with well-separated modes.
Burn-in, thinning, and convergence diagnostics
Discard early samples (burn-in)
The chain starts from an arbitrary point — often a prior draw or MAP
estimate — and needs time to reach the stationary region. Early
samples are biased toward the starting point. Discard the first
B iterations (warmup/burn-in). Modern HMC implementations
adapt step size during warmup, so “burn-in” doubles as tuning.
Thinning is usually unnecessary
Keeping every kth sample to reduce autocorrelation wastes
compute: you can compute ESS on the full chain and run longer instead.
Thin only when storage is genuinely constrained.
Diagnostic toolkit
- Trace plots — multiple chains should look like stationary, overlapping hairy caterpillars, not trends or stuck regions.
\hat{R}(R-hat) — compare within-chain and between-chain variance across 4+ chains started from dispersed points. Values near 1.0 (rule of thumb: < 1.01) suggest convergence.- Effective sample size (ESS) — how many approximately independent samples your correlated chain represents. Low ESS per parameter means noisy posterior mean/interval estimates.
- Divergences (HMC) — count and investigate; reparameterize or tighten priors.
- Posterior predictive checks — simulate data from posterior draws; systematic misfit means model or sampler problems.
Never report MCMC results from a single chain without diagnostics. At minimum:
4 chains, 1,000+ post-warmup iterations each (more for hard targets), and
check \hat{R} and ESS for every quantity you will quote.
Harbor Analytics hierarchical A/B refactor
Problem. Twelve regions, control vs treatment conversion
counts (s_i, n_i). Independent Beta(1,1) posteriors per region
were too wide; a fully pooled single rate ignored real regional differences.
Model. Hierarchical Beta-Binomial with partial pooling:
- Hyperprior:
\mu \sim \mathrm{Beta}(2, 2),\kappa \sim \mathrm{HalfNormal}(5) - Region rates:
\theta_i \sim \mathrm{Beta}(\mu\kappa, (1-\mu)\kappa) - Likelihood:
s_i \sim \mathrm{Binomial}(n_i, \theta_i)for treatment; separate\phi_ifor control with the same hierarchy - Target estimand: posterior of
\delta_i = \theta_i - \phi_iper region and global\bar{\delta}
Sampler. PyMC with NUTS, non-centered parameterization for
\theta_i, four chains, 2,000 tuning + 8,000 draws each.
Runtime: 87 seconds on Apple M2 / 42 seconds on a 16-core server.
Outcome. Three regions that looked like +4% lifts under
independent tests had 95% credible intervals crossing zero after pooling.
Two underpowered regions gained precision from shrinkage toward the global
mean without being forced to identical rates. Product shipped the redesign
globally with region-specific rollout gates tied to
P(\delta_i > 0.005 \mid D).
Method decision table
| Situation | Prefer | Why |
|---|---|---|
| Conjugate prior-likelihood pair | Closed-form posterior | Exact, instant; no sampling variance. |
| Low-dimensional, black-box log-posterior | Metropolis-Hastings or slice sampling | Simple to implement; tune proposal carefully. |
| Conjugate conditionals available | Gibbs sampling | High acceptance; easy blocks in hierarchical models. |
| 10–10,000 continuous parameters, differentiable | HMC / NUTS (Stan, PyMC, NumPyro) | State-of-the-art efficiency; built-in diagnostics. |
| Very large data, neural network weights | Stochastic gradient MCMC or variational inference | Full-data likelihood per step is too expensive. |
| Sequential state filtering online | Particle filter or Kalman filter | MCMC is batch; filters update beliefs per timestep. |
| Need fast approximate intervals at scale | Variational inference (ADVI, Laplace) | Optimization not sampling; can underestimate uncertainty. |
| Multimodal posterior, isolated modes | Parallel tempering or tempered transitions | Single-chain HMC may miss modes. |
Common pitfalls
- Stopping when the chain “looks fine”: visual
inspection alone misses slow mixing; always compute
\hat{R}and ESS. - One chain: cannot detect non-convergence or label switching; run at least four dispersed chains.
- Improper priors: flat priors on variances or unbounded scales can produce non-normalizable posteriors; use weakly informative priors.
- Centered hierarchical parameterization: Neal’s funnel
causes divergences; switch to non-centered when
\kappais weakly identified. - Confusing MCMC error with model error: a converged sampler from a misspecified model still gives wrong answers; do posterior predictive checks.
- Reporting only posterior means: full distributions matter for decisions; quote credible intervals and probabilities of practical significance.
- Using MCMC for conjugate models: wastes compute; use
analytic posteriors or
scipy.stats/ conjugate updates first.
Practitioner checklist
- Verify no conjugate closed form exists before reaching for MCMC.
- Write the generative model explicitly (priors, likelihood, estimands) before coding.
- Choose a mature sampler (PyMC, Stan, NumPyro, Turing.jl) rather than hand-rolling Metropolis for production models.
- Use non-centered parameterizations for hierarchical models with group-level variation.
- Run ≥ 4 chains from overdispersed starting points; set adequate warmup (1,000+ for HMC).
- Check
\hat{R} < 1.01and ESS > 400 per key parameter before quoting results. - Inspect trace plots, rank plots, and energy Bayesian fraction for HMC divergences.
- Perform posterior predictive checks against held-out or simulated data.
- Document priors, sampler settings, seed, and software versions for reproducibility.
- Compare runtime and uncertainty calibration against variational or grid baselines on a toy subset.
Key takeaways
- MCMC turns integration into sampling by building Markov chains whose stationary distribution is the target posterior.
- Metropolis-Hastings is general; Gibbs exploits tractable conditionals; HMC/NUTS scales to modern Bayesian models.
- Correlated samples demand convergence diagnostics — never trust a single short chain.
- Hierarchical and nonlinear models are the sweet spot where MCMC earns its keep over conjugate shortcuts.
- Pair MCMC with model criticism; a converged chain from the wrong model is confidently wrong.
Related reading
- Bayesian inference explained — priors, posteriors, and when full distributions beat point estimates
- Markov chains explained — transition matrices, stationarity, and ergodicity foundations
- Particle filter explained — sequential Monte Carlo for online state estimation
- A/B testing explained — frequentist experiment design that hierarchical Bayes can complement