Guide
Variational inference explained
Harbor Analytics monitors topic drift in 2.4 million customer-support tickets
per quarter. A latent Dirichlet allocation (LDA) model with 40 topics and
hierarchical priors captures emerging product complaints, but the joint
posterior over topic proportions and word distributions has no closed form.
Full
MCMC
refresh took 18 hours on a 32-core box — too slow for hourly dashboards.
Variational inference (VI) reframes posterior approximation as
optimization: pick a tractable family of distributions
q(\theta), measure divergence from the true
posterior
p(\theta \mid D), and maximize a lower bound called the
evidence lower bound (ELBO). Harbor’s VI pipeline
converges in 11 minutes with topic rankings that match MCMC on 94% of
top-10 terms (Kendall’s τ = 0.91). This guide explains why VI
trades exactness for speed, derives the ELBO intuition, covers mean-field and
coordinate-ascent updates, black-box gradients via the reparameterization trick,
walks through the Harbor LDA refactor, contrasts VI with MCMC and conjugate
methods, lists pitfalls, and ends with a practitioner checklist.
The optimization view of approximate Bayes
Exact Bayesian inference computes
p(\theta \mid D) = p(D \mid \theta)\, p(\theta) / p(D).
When the integral p(D) = \int p(D \mid \theta)\, p(\theta)\, d\theta
is intractable, you have two broad strategies:
- Sampling (MCMC) — generate correlated draws whose empirical distribution approaches the posterior. Accurate but can be slow, especially in high dimensions or with tight coupling between parameters.
- Optimization (VI) — posit a simpler distribution
q_\lambda(\theta)parameterized by\lambda, and find\lambdathat makesqclosest top(\theta \mid D)under a chosen divergence.
VI does not produce exact posterior samples. It produces a surrogate you can evaluate, differentiate through, and refresh cheaply. That makes it the default engine inside large-scale topic models, variational autoencoders, and many probabilistic programming pipelines where MCMC latency would block production.
ELBO: the objective you actually maximize
VI typically minimizes Kullback-Leibler (KL) divergence from
q to the posterior:
\mathrm{KL}(q \,\|\, p(\cdot \mid D)). Direct minimization
still requires p(D), but algebra rearranges the problem into
maximizing the ELBO:
\mathcal{L}(\lambda) = \mathbb{E}_{q_\lambda}[\log p(D \mid \theta)] - \mathrm{KL}(q_\lambda(\theta) \,\|\, p(\theta))
The ELBO decomposes into two interpretable terms:
- Expected log-likelihood
\mathbb{E}_q[\log p(D \mid \theta)]— how well parameters drawn fromqexplain the data. - Prior penalty
\mathrm{KL}(q \,\|\, p(\theta))— how far the approximate posterior drifts from the prior; acts like regularization.
Because \log p(D) = \mathcal{L}(\lambda) + \mathrm{KL}(q \,\|\, p(\cdot \mid D))
and KL is non-negative, maximizing the ELBO tightens the bound on
marginal likelihood p(D) while pulling q toward the
true posterior. You never compute p(D) explicitly — only
unnormalized terms \log p(D \mid \theta) + \log p(\theta) inside
expectations over q.
Mean-field variational inference
The simplest variational family assumes factorization (mean-field):
q(\theta) = \prod_{j=1}^{J} q_j(\theta_j)
Each factor q_j is chosen from a convenient exponential family
(Gaussian, Gamma, Dirichlet) so expectations and entropies are analytic.
Mean-field VI ignores posterior correlations between parameters — a
strong approximation when parameters strongly interact (e.g. hierarchical
shrinkage), but often acceptable for scale.
Coordinate ascent VI (CAVI) optimizes one factor at a time holding others fixed. For conjugate-exponential models, each update has a closed form — the algorithm alternates updates until the ELBO plateaus. Classic examples: Bayesian mixture models, LDA (where CAVI is the original scalable training algorithm), and naive Bayes with unknown class priors.
When mean-field breaks
Strong coupling — such as funnel-shaped hierarchical priors or
multimodal posteriors — defeats mean-field assumptions. A unimodal
Gaussian q centered between two posterior modes underestimates
uncertainty and averages incompatible explanations. In those cases, richer
families (full-rank Gaussian, normalizing flows) or MCMC validation become
necessary.
Black-box VI and the reparameterization trick
When closed-form CAVI updates do not exist — neural network weights,
deep generative models, non-conjugate likelihoods —
black-box VI estimates gradients of the ELBO with Monte
Carlo samples from q_\lambda.
The reparameterization trick writes
\theta = g(\lambda, \epsilon) where
\epsilon \sim p(\epsilon) is noise independent of
\lambda. Then:
\nabla_\lambda \mathbb{E}_{q_\lambda}[f(\theta)] = \mathbb{E}_\epsilon[\nabla_\lambda f(g(\lambda, \epsilon))]
Low-variance gradient estimates enable stochastic gradient ascent on
\lambda with minibatches — the same machinery behind
variational autoencoders (VAEs)
and modern ADVI implementations in Stan, PyMC, and TensorFlow Probability.
When reparameterization is impossible (discrete latents), score-function
(REINFORCE) gradient estimators work but carry higher variance.
Harbor Analytics LDA refactor
Harbor’s support corpus: 2.4M tickets, vocabulary size 28,000 after
pruning, 40 topics, symmetric Dirichlet priors
\alpha = 0.1 (document-topic) and \eta = 0.01
(topic-word). The generative story: each document draws a topic mixture
\theta_d \sim \mathrm{Dir}(\alpha); each word draws a topic
assignment z_{dn} then a word from
\phi_k \sim \mathrm{Dir}(\eta).
MCMC baseline: collapsed Gibbs sampling on a 500K-ticket subset ran 18 hours for 2,000 post-burn-in iterations. Full-corpus runs were infeasible for hourly refresh.
VI refactor: mean-field CAVI with Dirichlet factors for
\theta_d and \phi_k, plus stochastic subsampling
of documents per epoch. Implementation in a probabilistic programming stack
with automatic ELBO tracking. Results on held-out 200K tickets:
- Runtime: 11 minutes on the same hardware (98x faster than MCMC subset).
- Perplexity: 412 (VI) vs 398 (MCMC) — 3.5% gap, acceptable for monitoring dashboards.
- Top-10 topic words: Kendall’s τ = 0.91 vs MCMC reference.
- Alert precision on emerging “billing dispute” spike: 0.87 (VI) vs 0.89 (MCMC) — operational decisions unchanged.
Harbor runs VI hourly for drift detection and retrains MCMC weekly on a stratified sample for audit calibration — a common production pattern when speed and fidelity trade off.
Method decision table
| Method | When to prefer it | Cost / fidelity |
|---|---|---|
| Conjugate closed form | Beta-Binomial, Normal-Normal, small discrete state | Exact, instant; model scope limited |
| Mean-field / CAVI VI | Large data, exponential-family models, production refresh | Fast; underestimates variance, misses multimodality |
| Black-box VI (ADVI, VAE) | Neural generative models, non-conjugate likelihoods | Scales with SGD; local optima, mode-seeking KL |
| MCMC (HMC/NUTS) | Low-dimensional, high-stakes inference, audit trail | Gold-standard asymptotics; slow, tuning burden |
| Particle filters | Sequential state estimation, multimodal tracking | Online; particle degeneracy in high dimensions |
| Gaussian processes | Small-n regression with calibrated uncertainty | Exact posterior on weights in RKHS; cubic in n |
A practical workflow: start with VI for speed, spot-check critical parameters with short MCMC runs or simulation-based calibration, and escalate to full MCMC when decisions are irreversible or regulatory.
Common pitfalls
- Mean-field correlation blindness: posterior uncertainty
on one parameter depends on another; factorized
qshrinks credible intervals. Use full-rank Gaussian or structured VI when coupling is known. - KL direction matters: minimizing
\mathrm{KL}(q \,\|\, p)is mode-seeking (covers one peak);\mathrm{KL}(p \,\|\, q)is mass-covering but harder to optimize. Know which your library uses. - Local ELBO optima: random restarts and learning-rate schedules help; compare ELBO across runs and inspect posterior predictive checks.
- Underestimated tails: VI with Gaussian
qon heavy-tailed posteriors produces overconfident predictions. Student-t factors or MCMC spot checks catch this. - Ignoring prior sensitivity: the KL term pulls toward the prior; weak priors plus aggressive mean-field can yield improper-like behavior. Always run prior predictive simulations.
- Equating ELBO with model quality: higher ELBO means better variational fit, not necessarily better out-of-sample prediction. Hold out data for perplexity, log-score, or decision metrics.
Practitioner checklist
- Write the generative model and identify which parameters need approximate posteriors.
- Check for conjugate structure before reaching for VI or MCMC.
- Choose variational family: mean-field, full-rank Gaussian, or flow-based.
- Track ELBO each iteration; plateau signals convergence or learning-rate issues.
- Compare VI posteriors to short MCMC on a subset for calibration.
- Run posterior predictive checks on held-out data, not just in-sample ELBO.
- Report uncertainty honestly: VI intervals are often too narrow.
- For discrete latents, plan score-function gradients or continuous relaxations.
- Document refresh cadence and when full MCMC audit runs occur.
- Version model priors and VI seeds so dashboards are reproducible.
Key takeaways
- VI turns Bayesian inference into optimization over a tractable surrogate distribution.
- The ELBO balances data fit (expected log-likelihood) against prior regularization (KL term).
- Mean-field CAVI scales to millions of observations when posterior coupling is mild.
- Black-box VI with reparameterization gradients powers VAEs and modern probabilistic deep learning.
- Validate VI with MCMC spot checks when decisions depend on calibrated uncertainty.
Related reading
- Bayesian inference explained — priors, posteriors, and when Bayes beats point estimates
- Markov chain Monte Carlo (MCMC) explained — sampling intractable posteriors with convergence diagnostics
- Autoencoders and VAEs explained — reparameterization and latent generative models
- Gaussian processes explained — exact Bayesian regression with kernel uncertainty