Guide

Markov chain Monte Carlo (MCMC) explained

Harbor Analytics ran a checkout redesign test across 12 geographic regions. Each region had thin traffic — 200–800 sessions per week — so independent Beta-Binomial estimates swung wildly. A hierarchical Bayesian model that pooled information across regions (partial pooling) produced sensible regional posteriors, but the joint posterior over 24 conversion rates plus hyperparameters had no closed form. Analysts switched from grid approximations to Markov chain Monte Carlo (MCMC), drawing 40,000 correlated samples in under 90 seconds on a laptop. False-positive “winner” calls dropped 34% versus per-region frequentist tests. MCMC is a family of algorithms that construct a Markov chain whose long-run distribution equals a target you cannot sample from directly — typically a Bayesian posterior. This guide covers the core intuition, Metropolis-Hastings and Gibbs sampling, Hamiltonian Monte Carlo (HMC) at a high level, burn-in and convergence diagnostics, a Harbor Analytics hierarchical A/B refactor, a method decision table, pitfalls, and a production checklist.

Why posteriors become intractable

Bayes’ theorem is simple on paper: p(\theta \mid D) \propto p(D \mid \theta)\, p(\theta). For conjugate pairs (Beta-Binomial, Normal-Normal) you integrate analytically and get a known distribution. Real models break that convenience quickly:

Hierarchical structure — region-level rates drawn from a shared hyperprior; the joint posterior couples every region.
Nonlinear likelihoods — logistic regression, survival models, or neural network weights have no closed-form posterior.
Latent variables — mixture models, topic models, and item-response models integrate over hidden assignments.
Constraints and combinatorics — order statistics, change-point models, or spatial random fields create awkward normalizing constants.

The normalizing constant Z = \int p(D \mid \theta)\, p(\theta)\, d\theta is often the hard part: you may evaluate the unnormalized posterior \tilde{p}(\theta) = p(D \mid \theta)\, p(\theta) pointwise, but not integrate it. MCMC sidesteps explicit integration: it generates samples whose empirical distribution approximates p(\theta \mid D).

MCMC intuition: ergodic sampling

Imagine you want to know the shape of a mountain lake (the posterior) but cannot measure depth everywhere. MCMC sends a boat that:

Starts at a random point on the water.
Proposes small moves according to simple rules.
Accepts moves that visit high-probability regions more often.
After many steps, the boat’s path visits each depth in proportion to how common that depth is.

Formally, you build a Markov chain \theta^{(0)}, \theta^{(1)}, \theta^{(2)}, \ldots with transition kernel K(\theta' \mid \theta) such that the stationary distribution is the target posterior. Under ergodicity conditions (irreducibility, aperiodicity, detailed balance or appropriate Metropolis balance), time averages converge to posterior expectations:

\mathbb{E}[f(\theta) \mid D] \approx \frac{1}{N}\sum_{t=1}^{N} f(\theta^{(t)})

Unlike independent Monte Carlo (draw i.i.d. from the posterior), MCMC samples are correlated. That correlation is the price of not knowing Z. Diagnostics like effective sample size (ESS) and \hat{R} (Gelman-Rubin) quantify how much information you actually collected.

Metropolis-Hastings

The workhorse MCMC algorithm. At step t:

Propose a candidate \theta' \sim q(\theta' \mid \theta^{(t)}) from a simple distribution (random walk Gaussian, uniform perturbation).
Compute acceptance ratio \alpha = \min\!\left(1,\; \frac{\tilde{p}(\theta')\, q(\theta^{(t)} \mid \theta')}{\tilde{p}(\theta^{(t)})\, q(\theta' \mid \theta^{(t)})}\right). Notice Z cancels — you only need ratios of unnormalized posteriors.
Accept or reject: with probability \alpha, set \theta^{(t+1)} = \theta'; otherwise \theta^{(t+1)} = \theta^{(t)}.

When the proposal is symmetric (q(\theta' \mid \theta) = q(\theta \mid \theta')), this reduces to the original Metropolis algorithm. Tuning the proposal scale is critical: too small and the chain moves slowly (high autocorrelation); too large and almost every proposal is rejected. Adaptive samplers during burn-in can auto-tune step size toward an acceptance rate near 0.2–0.4 in moderate dimensions.

Metropolis-Hastings is general — it works on any model where you can evaluate \log \tilde{p}(\theta) — but high-dimensional random-walk proposals explore space inefficiently. That is where Gibbs sampling and HMC help.

Gibbs sampling

When the full posterior is hard but conditional posteriors are tractable, Gibbs sampling cycles through coordinates or blocks:

Sample \theta_1 \sim p(\theta_1 \mid \theta_2^{(t)}, \theta_3^{(t)}, D)
Sample \theta_2 \sim p(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, D)
Sample \theta_3 \sim p(\theta_3 \mid \theta_1^{(t+1)}, \theta_2^{(t+1)}, D)
Repeat.

Each step is an exact draw from a conditional — acceptance is always 1. Gibbs shines in conjugate hierarchical models: Normal-Normal hierarchies, linear regression with known variance, some mixture models with data augmentation. When a conditional lacks a closed form, you embed a Metropolis-Hastings step inside the cycle (Metropolis-within-Gibbs).

The downside: strong posterior correlation between parameters can make Gibbs crawl. If \theta_1 and \theta_2 are nearly deterministically linked given data, alternating updates move along a narrow ridge. Reparameterization (non-centered parameterizations in hierarchical models) or joint block updates via HMC often fix this.

Hamiltonian Monte Carlo and modern samplers

Hamiltonian Monte Carlo (HMC) treats the log-posterior as a potential energy landscape and simulates Hamiltonian dynamics with momentum variables. Gradient information guides long, informed leaps instead of blind random walks. No-U-Turn Sampler (NUTS), the default in Stan, PyMC, and NumPyro, adaptively chooses trajectory length and step size.

HMC/NUTS dominates Bayesian workflow tools today because it scales better to hundreds of parameters and handles the funnel geometries that break Gibbs. You need differentiable log-posteriors (automatic differentiation through PyTorch/JAX) and should watch for divergences — they signal regions the sampler cannot explore faithfully, often from bad parameterization.

Other variants matter in specialized settings: slice sampling for univariate adaptive stepping; elliptical slice sampling for Gaussian priors with Gaussian likelihoods; sequential Monte Carlo when the target is a filtering distribution over time rather than a static posterior; and parallel tempering for multimodal posteriors with well-separated modes.

Burn-in, thinning, and convergence diagnostics

Discard early samples (burn-in)

The chain starts from an arbitrary point — often a prior draw or MAP estimate — and needs time to reach the stationary region. Early samples are biased toward the starting point. Discard the first B iterations (warmup/burn-in). Modern HMC implementations adapt step size during warmup, so “burn-in” doubles as tuning.

Thinning is usually unnecessary

Keeping every kth sample to reduce autocorrelation wastes compute: you can compute ESS on the full chain and run longer instead. Thin only when storage is genuinely constrained.

Diagnostic toolkit

Trace plots — multiple chains should look like stationary, overlapping hairy caterpillars, not trends or stuck regions.
\hat{R} (R-hat) — compare within-chain and between-chain variance across 4+ chains started from dispersed points. Values near 1.0 (rule of thumb: < 1.01) suggest convergence.
Effective sample size (ESS) — how many approximately independent samples your correlated chain represents. Low ESS per parameter means noisy posterior mean/interval estimates.
Divergences (HMC) — count and investigate; reparameterize or tighten priors.
Posterior predictive checks — simulate data from posterior draws; systematic misfit means model or sampler problems.

Never report MCMC results from a single chain without diagnostics. At minimum: 4 chains, 1,000+ post-warmup iterations each (more for hard targets), and check \hat{R} and ESS for every quantity you will quote.

Harbor Analytics hierarchical A/B refactor

Problem. Twelve regions, control vs treatment conversion counts (s_i, n_i). Independent Beta(1,1) posteriors per region were too wide; a fully pooled single rate ignored real regional differences.

Model. Hierarchical Beta-Binomial with partial pooling:

Hyperprior: \mu \sim \mathrm{Beta}(2, 2), \kappa \sim \mathrm{HalfNormal}(5)
Region rates: \theta_i \sim \mathrm{Beta}(\mu\kappa, (1-\mu)\kappa)
Likelihood: s_i \sim \mathrm{Binomial}(n_i, \theta_i) for treatment; separate \phi_i for control with the same hierarchy
Target estimand: posterior of \delta_i = \theta_i - \phi_i per region and global \bar{\delta}

Sampler. PyMC with NUTS, non-centered parameterization for \theta_i, four chains, 2,000 tuning + 8,000 draws each. Runtime: 87 seconds on Apple M2 / 42 seconds on a 16-core server.

Outcome. Three regions that looked like +4% lifts under independent tests had 95% credible intervals crossing zero after pooling. Two underpowered regions gained precision from shrinkage toward the global mean without being forced to identical rates. Product shipped the redesign globally with region-specific rollout gates tied to P(\delta_i > 0.005 \mid D).

Method decision table

Situation	Prefer	Why
Conjugate prior-likelihood pair	Closed-form posterior	Exact, instant; no sampling variance.
Low-dimensional, black-box log-posterior	Metropolis-Hastings or slice sampling	Simple to implement; tune proposal carefully.
Conjugate conditionals available	Gibbs sampling	High acceptance; easy blocks in hierarchical models.
10–10,000 continuous parameters, differentiable	HMC / NUTS (Stan, PyMC, NumPyro)	State-of-the-art efficiency; built-in diagnostics.
Very large data, neural network weights	Stochastic gradient MCMC or variational inference	Full-data likelihood per step is too expensive.
Sequential state filtering online	Particle filter or Kalman filter	MCMC is batch; filters update beliefs per timestep.
Need fast approximate intervals at scale	Variational inference (ADVI, Laplace)	Optimization not sampling; can underestimate uncertainty.
Multimodal posterior, isolated modes	Parallel tempering or tempered transitions	Single-chain HMC may miss modes.

Common pitfalls

Stopping when the chain “looks fine”: visual inspection alone misses slow mixing; always compute \hat{R} and ESS.
One chain: cannot detect non-convergence or label switching; run at least four dispersed chains.
Improper priors: flat priors on variances or unbounded scales can produce non-normalizable posteriors; use weakly informative priors.
Centered hierarchical parameterization: Neal’s funnel causes divergences; switch to non-centered when \kappa is weakly identified.
Confusing MCMC error with model error: a converged sampler from a misspecified model still gives wrong answers; do posterior predictive checks.
Reporting only posterior means: full distributions matter for decisions; quote credible intervals and probabilities of practical significance.
Using MCMC for conjugate models: wastes compute; use analytic posteriors or scipy.stats / conjugate updates first.

Practitioner checklist

Verify no conjugate closed form exists before reaching for MCMC.
Write the generative model explicitly (priors, likelihood, estimands) before coding.
Choose a mature sampler (PyMC, Stan, NumPyro, Turing.jl) rather than hand-rolling Metropolis for production models.
Use non-centered parameterizations for hierarchical models with group-level variation.
Run ≥ 4 chains from overdispersed starting points; set adequate warmup (1,000+ for HMC).
Check \hat{R} < 1.01 and ESS > 400 per key parameter before quoting results.
Inspect trace plots, rank plots, and energy Bayesian fraction for HMC divergences.
Perform posterior predictive checks against held-out or simulated data.
Document priors, sampler settings, seed, and software versions for reproducibility.
Compare runtime and uncertainty calibration against variational or grid baselines on a toy subset.

Key takeaways

MCMC turns integration into sampling by building Markov chains whose stationary distribution is the target posterior.
Metropolis-Hastings is general; Gibbs exploits tractable conditionals; HMC/NUTS scales to modern Bayesian models.
Correlated samples demand convergence diagnostics — never trust a single short chain.
Hierarchical and nonlinear models are the sweet spot where MCMC earns its keep over conjugate shortcuts.
Pair MCMC with model criticism; a converged chain from the wrong model is confidently wrong.