Guide

Variational inference explained

Harbor Analytics monitors topic drift in 2.4 million customer-support tickets per quarter. A latent Dirichlet allocation (LDA) model with 40 topics and hierarchical priors captures emerging product complaints, but the joint posterior over topic proportions and word distributions has no closed form. Full MCMC refresh took 18 hours on a 32-core box — too slow for hourly dashboards. Variational inference (VI) reframes posterior approximation as optimization: pick a tractable family of distributions q(\theta), measure divergence from the true posterior p(\theta \mid D), and maximize a lower bound called the evidence lower bound (ELBO). Harbor’s VI pipeline converges in 11 minutes with topic rankings that match MCMC on 94% of top-10 terms (Kendall’s τ = 0.91). This guide explains why VI trades exactness for speed, derives the ELBO intuition, covers mean-field and coordinate-ascent updates, black-box gradients via the reparameterization trick, walks through the Harbor LDA refactor, contrasts VI with MCMC and conjugate methods, lists pitfalls, and ends with a practitioner checklist.

The optimization view of approximate Bayes

Exact Bayesian inference computes p(\theta \mid D) = p(D \mid \theta)\, p(\theta) / p(D). When the integral p(D) = \int p(D \mid \theta)\, p(\theta)\, d\theta is intractable, you have two broad strategies:

  • Sampling (MCMC) — generate correlated draws whose empirical distribution approaches the posterior. Accurate but can be slow, especially in high dimensions or with tight coupling between parameters.
  • Optimization (VI) — posit a simpler distribution q_\lambda(\theta) parameterized by \lambda, and find \lambda that makes q closest to p(\theta \mid D) under a chosen divergence.

VI does not produce exact posterior samples. It produces a surrogate you can evaluate, differentiate through, and refresh cheaply. That makes it the default engine inside large-scale topic models, variational autoencoders, and many probabilistic programming pipelines where MCMC latency would block production.

ELBO: the objective you actually maximize

VI typically minimizes Kullback-Leibler (KL) divergence from q to the posterior: \mathrm{KL}(q \,\|\, p(\cdot \mid D)). Direct minimization still requires p(D), but algebra rearranges the problem into maximizing the ELBO:

\mathcal{L}(\lambda) = \mathbb{E}_{q_\lambda}[\log p(D \mid \theta)] - \mathrm{KL}(q_\lambda(\theta) \,\|\, p(\theta))

The ELBO decomposes into two interpretable terms:

  • Expected log-likelihood \mathbb{E}_q[\log p(D \mid \theta)] — how well parameters drawn from q explain the data.
  • Prior penalty \mathrm{KL}(q \,\|\, p(\theta)) — how far the approximate posterior drifts from the prior; acts like regularization.

Because \log p(D) = \mathcal{L}(\lambda) + \mathrm{KL}(q \,\|\, p(\cdot \mid D)) and KL is non-negative, maximizing the ELBO tightens the bound on marginal likelihood p(D) while pulling q toward the true posterior. You never compute p(D) explicitly — only unnormalized terms \log p(D \mid \theta) + \log p(\theta) inside expectations over q.

Mean-field variational inference

The simplest variational family assumes factorization (mean-field):

q(\theta) = \prod_{j=1}^{J} q_j(\theta_j)

Each factor q_j is chosen from a convenient exponential family (Gaussian, Gamma, Dirichlet) so expectations and entropies are analytic. Mean-field VI ignores posterior correlations between parameters — a strong approximation when parameters strongly interact (e.g. hierarchical shrinkage), but often acceptable for scale.

Coordinate ascent VI (CAVI) optimizes one factor at a time holding others fixed. For conjugate-exponential models, each update has a closed form — the algorithm alternates updates until the ELBO plateaus. Classic examples: Bayesian mixture models, LDA (where CAVI is the original scalable training algorithm), and naive Bayes with unknown class priors.

When mean-field breaks

Strong coupling — such as funnel-shaped hierarchical priors or multimodal posteriors — defeats mean-field assumptions. A unimodal Gaussian q centered between two posterior modes underestimates uncertainty and averages incompatible explanations. In those cases, richer families (full-rank Gaussian, normalizing flows) or MCMC validation become necessary.

Black-box VI and the reparameterization trick

When closed-form CAVI updates do not exist — neural network weights, deep generative models, non-conjugate likelihoods — black-box VI estimates gradients of the ELBO with Monte Carlo samples from q_\lambda.

The reparameterization trick writes \theta = g(\lambda, \epsilon) where \epsilon \sim p(\epsilon) is noise independent of \lambda. Then:

\nabla_\lambda \mathbb{E}_{q_\lambda}[f(\theta)] = \mathbb{E}_\epsilon[\nabla_\lambda f(g(\lambda, \epsilon))]

Low-variance gradient estimates enable stochastic gradient ascent on \lambda with minibatches — the same machinery behind variational autoencoders (VAEs) and modern ADVI implementations in Stan, PyMC, and TensorFlow Probability. When reparameterization is impossible (discrete latents), score-function (REINFORCE) gradient estimators work but carry higher variance.

Harbor Analytics LDA refactor

Harbor’s support corpus: 2.4M tickets, vocabulary size 28,000 after pruning, 40 topics, symmetric Dirichlet priors \alpha = 0.1 (document-topic) and \eta = 0.01 (topic-word). The generative story: each document draws a topic mixture \theta_d \sim \mathrm{Dir}(\alpha); each word draws a topic assignment z_{dn} then a word from \phi_k \sim \mathrm{Dir}(\eta).

MCMC baseline: collapsed Gibbs sampling on a 500K-ticket subset ran 18 hours for 2,000 post-burn-in iterations. Full-corpus runs were infeasible for hourly refresh.

VI refactor: mean-field CAVI with Dirichlet factors for \theta_d and \phi_k, plus stochastic subsampling of documents per epoch. Implementation in a probabilistic programming stack with automatic ELBO tracking. Results on held-out 200K tickets:

  • Runtime: 11 minutes on the same hardware (98x faster than MCMC subset).
  • Perplexity: 412 (VI) vs 398 (MCMC) — 3.5% gap, acceptable for monitoring dashboards.
  • Top-10 topic words: Kendall’s τ = 0.91 vs MCMC reference.
  • Alert precision on emerging “billing dispute” spike: 0.87 (VI) vs 0.89 (MCMC) — operational decisions unchanged.

Harbor runs VI hourly for drift detection and retrains MCMC weekly on a stratified sample for audit calibration — a common production pattern when speed and fidelity trade off.

Method decision table

Method When to prefer it Cost / fidelity
Conjugate closed form Beta-Binomial, Normal-Normal, small discrete state Exact, instant; model scope limited
Mean-field / CAVI VI Large data, exponential-family models, production refresh Fast; underestimates variance, misses multimodality
Black-box VI (ADVI, VAE) Neural generative models, non-conjugate likelihoods Scales with SGD; local optima, mode-seeking KL
MCMC (HMC/NUTS) Low-dimensional, high-stakes inference, audit trail Gold-standard asymptotics; slow, tuning burden
Particle filters Sequential state estimation, multimodal tracking Online; particle degeneracy in high dimensions
Gaussian processes Small-n regression with calibrated uncertainty Exact posterior on weights in RKHS; cubic in n

A practical workflow: start with VI for speed, spot-check critical parameters with short MCMC runs or simulation-based calibration, and escalate to full MCMC when decisions are irreversible or regulatory.

Common pitfalls

  • Mean-field correlation blindness: posterior uncertainty on one parameter depends on another; factorized q shrinks credible intervals. Use full-rank Gaussian or structured VI when coupling is known.
  • KL direction matters: minimizing \mathrm{KL}(q \,\|\, p) is mode-seeking (covers one peak); \mathrm{KL}(p \,\|\, q) is mass-covering but harder to optimize. Know which your library uses.
  • Local ELBO optima: random restarts and learning-rate schedules help; compare ELBO across runs and inspect posterior predictive checks.
  • Underestimated tails: VI with Gaussian q on heavy-tailed posteriors produces overconfident predictions. Student-t factors or MCMC spot checks catch this.
  • Ignoring prior sensitivity: the KL term pulls toward the prior; weak priors plus aggressive mean-field can yield improper-like behavior. Always run prior predictive simulations.
  • Equating ELBO with model quality: higher ELBO means better variational fit, not necessarily better out-of-sample prediction. Hold out data for perplexity, log-score, or decision metrics.

Practitioner checklist

  • Write the generative model and identify which parameters need approximate posteriors.
  • Check for conjugate structure before reaching for VI or MCMC.
  • Choose variational family: mean-field, full-rank Gaussian, or flow-based.
  • Track ELBO each iteration; plateau signals convergence or learning-rate issues.
  • Compare VI posteriors to short MCMC on a subset for calibration.
  • Run posterior predictive checks on held-out data, not just in-sample ELBO.
  • Report uncertainty honestly: VI intervals are often too narrow.
  • For discrete latents, plan score-function gradients or continuous relaxations.
  • Document refresh cadence and when full MCMC audit runs occur.
  • Version model priors and VI seeds so dashboards are reproducible.

Key takeaways

  • VI turns Bayesian inference into optimization over a tractable surrogate distribution.
  • The ELBO balances data fit (expected log-likelihood) against prior regularization (KL term).
  • Mean-field CAVI scales to millions of observations when posterior coupling is mild.
  • Black-box VI with reparameterization gradients powers VAEs and modern probabilistic deep learning.
  • Validate VI with MCMC spot checks when decisions depend on calibrated uncertainty.

Related reading