Guide

Bayesian inference explained

Your landing page converts 8 out of 100 visitors. A frequentist report says “8% with a 95% confidence interval of 3.5% to 12.5%.” That is useful, but it treats the true conversion rate as a fixed unknown. Bayesian inference instead asks: given what I believed before and what I just observed, what is the full distribution of plausible conversion rates now? The answer is a posterior distribution you can update incrementally as traffic arrives, use for decision-making under uncertainty, and feed into bandit algorithms like Thompson sampling. Bayesian thinking powers spam filters ( Naive Bayes), A/B test analysis with early stopping, Bayesian neural networks, and modern probabilistic programming. This guide covers Bayes theorem, priors and posteriors, conjugate shortcuts, when you need MCMC or variational inference, how Bayesian credible intervals differ from frequentist confidence intervals, connections to A/B testing and logistic regression, a worked conversion-rate example, a decision table, common pitfalls, and a practitioner checklist.

What Bayesian inference does

Classical (frequentist) statistics estimates parameters from data and constructs intervals with long-run coverage guarantees. Bayesian statistics treats parameters as random variables with distributions that encode uncertainty. You start with a prior — what you believe before seeing data — multiply by the likelihood of the observed data given each parameter value, and normalize to get the posterior.

The core formula is Bayes theorem:

P(θ | data) ∝ P(data | θ) × P(θ)

Posterior is proportional to likelihood times prior. The posterior becomes the prior for the next batch of data — natural sequential updating that matches how product teams actually operate (ship, measure, revise).

Bayesian inference does not mean “subjective guessing.” Priors can be weak (barely influence results with enough data) or informative (encode domain knowledge when data is scarce). The framework forces you to state assumptions explicitly instead of hiding them in methodological defaults.

Priors: weak, informative, and conjugate

Uninformative / weak priors spread probability mass broadly so data dominates. A Beta(1, 1) prior on a conversion rate is uniform over [0, 1]. With 8/100 conversions, the posterior is Beta(9, 93) — easy to compute without sampling.

Informative priors encode real beliefs: “similar pages convert around 5%” might be Beta(5, 95). With little data, the prior pulls estimates toward 5%; with thousands of observations, data overwhelms the prior. Document priors in experiment specs so stakeholders understand what was assumed.

Conjugate priors are mathematically paired with likelihoods so the posterior has the same functional form as the prior — closed-form updates without MCMC:

  • Bernoulli/Binomial likelihood + Beta prior → Beta posterior
  • Poisson likelihood + Gamma prior → Gamma posterior
  • Normal likelihood with known variance + Normal prior → Normal posterior
  • Multinomial likelihood + Dirichlet prior → Dirichlet posterior

Conjugate pairs are the fast path for dashboards, bandits, and any online system that must update beliefs in milliseconds.

Computing posteriors: analytical, MCMC, and variational

Closed form

When conjugacy applies, compute the posterior directly. Beta-Binomial updates are how many production Thompson-sampling bandits maintain arm beliefs without a sampling library.

Markov Chain Monte Carlo (MCMC)

For complex models — hierarchical regression, Bayesian neural nets, spatial models — the posterior has no closed form. MCMC (Metropolis-Hastings, Hamiltonian Monte Carlo in Stan/PyMC) draws samples from the posterior by constructing a Markov chain whose stationary distribution is the target. Diagnostics matter: check R-hat, effective sample size, and trace plots for convergence. MCMC is accurate but can be slow on large datasets.

Variational inference (VI)

VI approximates the posterior with a simpler distribution (often a factorized Gaussian) by optimizing a lower bound (ELBO). Faster than MCMC, scales to big data and deep models, but introduces approximation bias. Variational autoencoders use this idea for latent representations — see VAEs explained for the generative angle.

Credible intervals vs confidence intervals

A 95% credible interval means: given the model, prior, and data, there is a 95% probability the true parameter lies in the interval. Directly interpretable for decision-makers.

A 95% confidence interval (frequentist) means: if you repeated the experiment many times, 95% of constructed intervals would contain the true parameter. The parameter is fixed; the interval is random. That subtle distinction confuses stakeholders — Bayesian credible intervals often communicate uncertainty more clearly in product contexts.

With flat priors and large samples, Bayesian and frequentist intervals often numerically agree. They diverge with small samples, informative priors, or sequential peeking — exactly when Bayesian updating shines.

Bayesian methods in machine learning

  • Naive Bayes classifiers — apply Bayes theorem with a conditional independence assumption; see the dedicated Naive Bayes guide.
  • Bayesian logistic regression — posterior over weights instead of point estimates; useful for uncertainty-aware classification and small-data regimes.
  • Thompson sampling — sample from each arm’s posterior to balance exploration and exploitation in bandits and personalization.
  • Bayesian optimization — model an expensive black-box function (hyperparameter loss) with a Gaussian process prior; pick the next evaluation by an acquisition function. Pairs naturally with hyperparameter tuning.
  • Bayesian model averaging — weight predictions by posterior model probability instead of picking a single winner.

Worked example: landing-page conversion rate

Prior: Beta(2, 38) — belief centered near 5% conversion (mean 2/40 = 0.05) but not dogmatic. Observed: 8 conversions in 100 trials (Binomial).

Posterior: Beta(2 + 8, 38 + 92) = Beta(10, 130). Posterior mean = 10/140 ≈ 7.1%. A 95% credible interval (quantiles of Beta(10, 130)) is roughly 3.1% to 12.0%.

Compare to a flat Beta(1, 1) prior: posterior Beta(9, 93), mean 8.8%, wider interval. The informative prior pulled the estimate toward 5% because 100 observations is still a modest sample. As traffic grows to 10,000 conversions, both priors converge to nearly the same posterior — data swamps weak priors, as it should.

Decision rule: if P(conversion > 6% | data) > 0.95, declare the page beats the 6% hurdle for paid acquisition. Bayesian inference gives that probability directly from the posterior CDF — no extra machinery.

Bayesian vs frequentist: decision table

Scenario Prefer Bayesian Prefer frequentist
Small sample, need uncertainty quantification Yes — posterior captures wide uncertainty naturally Intervals can be unstable or uninformative
Sequential monitoring / early stopping Yes — valid updating without p-hacking corrections Requires careful sequential testing adjustments
Large RCT with fixed analysis plan Either works; results often similar with flat priors Standard tooling and regulatory familiarity
Online bandits / personalization Yes — Thompson sampling is native Bayesian UCB and epsilon-greedy are frequentist alternatives
Complex hierarchical models Yes — partial pooling via priors is natural Mixed models exist but priors express structure cleanly
Stakeholders want “probability treatment wins” Yes — posterior predictive probabilities p-values do not answer that question directly

Common pitfalls

  • Hidden informative priors — using defaults in PyMC or Stan without reading what they assume; always justify and sensitivity-test priors.
  • Confusing prior with data — reporting posterior means without showing how much the prior vs likelihood contributed.
  • MCMC without diagnostics — trusting samples from a chain that never converged.
  • Double-counting data — using the same observations to set the prior and compute the likelihood.
  • Ignoring computational cost — running full MCMC on every API request when a conjugate Beta update suffices.
  • Overstating objectivity — both paradigms make assumptions; Bayes makes priors explicit, frequentism embeds them in procedures.

Practitioner checklist

  • State the parameter of interest and choose a likelihood that matches the data-generating process (Binomial for conversions, Poisson for counts).
  • Pick a prior deliberately — document it in the experiment spec.
  • Run prior predictive checks: do simulated data from the prior look plausible?
  • Use conjugate updates when available before reaching for MCMC.
  • If sampling, verify convergence (R-hat, ESS, trace plots).
  • Report posterior mean, median, and credible intervals — not just a point estimate.
  • Sensitivity-test: rerun with a flat prior and an alternative informative prior; note how conclusions change.
  • For product decisions, translate posterior into actionable probabilities (P(rate > threshold | data)).
  • Pair with posterior predictive checks on held-out data.
  • Log prior, likelihood, and posterior versions for reproducibility.

Key takeaways

  • Bayesian inference updates a distribution over parameters — not just a point estimate — using prior beliefs and observed data.
  • Conjugate priors enable fast closed-form updates for common metrics like conversion rates and click-through rates.
  • MCMC and VI extend Bayesian methods to complex models when analytical posteriors do not exist.
  • Credible intervals answer “what range do we believe the parameter is in?” — often clearer for stakeholders than frequentist confidence intervals.
  • Bayesian thinking underpins bandits, A/B analysis with sequential data, and uncertainty-aware ML — know when the framework earns its complexity.

Related reading