Guide
Gaussian mixture models and the EM algorithm explained
Harbor Analytics segments 180,000 merchants by monthly transaction volume,
chargeback rate, and average ticket size to route risk reviews. A
K-means
model with K = 5 produced hard labels, but centroids drifted
when seasonal volume shifted and elliptical merchant cohorts were forced into
spherical partitions. Risk analysts wanted probabilities —
“72% High-Velocity Retail, 28% Seasonal Pop-up” — not brittle
winner-take-all assignments. A Gaussian mixture model (GMM)
treats each data point as drawn from one of K multivariate
Gaussian components with unknown mixing weights. The
expectation-maximization (EM) algorithm alternates between
estimating soft component memberships (E-step) and updating means, covariances,
and weights (M-step) until log-likelihood plateaus. Harbor’s 5-component
full-covariance GMM improved held-out log-likelihood by 22% over K-means and
stabilized segment dashboards across quarterly refreshes. This guide explains
the GMM generative story, derives EM intuition step by step, connects GMM to
K-means and
Baum-Welch EM in HMMs,
walks the Harbor merchant refactor, contrasts GMM with K-means, DBSCAN, and
variational inference,
lists pitfalls, and ends with a practitioner checklist.
What a GMM assumes about your data
A GMM posits that observations x_1, \ldots, x_n \in \mathbb{R}^d
are i.i.d. samples from a mixture density:
p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)
where \pi_k \ge 0, \sum_k \pi_k = 1, and each
component is a multivariate Gaussian with mean \mu_k and
covariance \Sigma_k. Latent indicator variables
z_i \in \{1,\ldots,K\} say which component generated point
x_i, but z_i is unobserved. Marginalizing over
z yields the mixture above.
GMMs serve three overlapping jobs:
- Soft clustering — posterior
p(z_i = k \mid x_i)gives membership probabilities instead of hard labels. - Density estimation — the fitted mixture is a
smooth probability model you can score, sample from, and use for anomaly
detection (low
p(x)flags outliers). - Feature of larger models — GMM emissions appear inside HMMs, speaker diarization, and some Bayesian pipelines as flexible observation models.
Unlike K-means, GMMs model covariance: elongated clusters at an angle are one component, not a distortion of spherical partitions.
EM: maximize likelihood with missing labels
Directly maximizing log-likelihood
\sum_i \log \sum_k \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)
has no closed-form solution because the sum is inside the log. EM treats
latent z_i as missing data and iterates:
E-step (expectation)
Given current parameters
\theta = \{\pi_k, \mu_k, \Sigma_k\}, compute
responsibilities — the posterior probability that
point i belongs to component k:
\gamma_{ik} = p(z_i = k \mid x_i, \theta) =
\frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}
{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)}
Each row \gamma_{i\cdot} sums to 1. Hard clustering would take
\arg\max_k \gamma_{ik}; EM keeps the soft weights.
M-step (maximization)
Update parameters as weighted maximum-likelihood estimates:
N_k = \sum_i \gamma_{ik}(effective count in componentk)\pi_k \leftarrow N_k / n\mu_k \leftarrow \frac{1}{N_k} \sum_i \gamma_{ik} x_i\Sigma_k \leftarrow \frac{1}{N_k} \sum_i \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^\top
Each EM iteration is guaranteed to increase (or leave unchanged) the observed-data log-likelihood. Convergence is declared when improvement falls below a tolerance or iteration cap is hit. EM finds local optima, not global ones — multiple random initializations are standard practice.
Connection to K-means
K-means is EM on a GMM with equal spherical covariances
\Sigma_k = \sigma^2 I and hard assignments: the E-step
becomes nearest-centroid, the M-step updates centroids as cluster means.
GMM generalizes this with ellipses, mixing weights, and probabilistic
memberships.
Covariance parameterizations and regularization
Full \Sigma_k has d(d+1)/2 parameters per
component. In high dimensions with modest n, covariances
become singular or ill-conditioned. Common restrictions:
- Spherical —
\Sigma_k = \sigma_k^2 I; one variance per component (K-means-like). - Diagonal — axis-aligned ellipses; scales to
moderate
dwith fewer parameters. - Tied — all components share one
\Sigma; useful when clusters differ in location but not shape. - Full — arbitrary ellipses; needs enough data per component.
Add covariance regularization (e.g. \Sigma_k +
\epsilon I) to prevent collapse when a component receives few points.
Libraries like scikit-learn expose reg_covar for this reason.
Choosing K: BIC, AIC, and stability
More components always fit training data better, so you need a penalty for complexity:
- BIC (Bayesian Information Criterion) —
\mathrm{BIC} = -2 \log \hat{L} + p \log n, where\hat{L}is maximized likelihood andpcounts free parameters. Lower is better; favors parsimony. - AIC — replaces
\log nwith 2; less penalty, often picks more components.
Run EM from several random seeds for each candidate K, keep the
best log-likelihood, compare BIC on a held-out set. Also inspect
stability: if two seeds yield different segment semantics,
the model is under-identified. Silhouette scores on hard assignments
(\arg\max_k \gamma_{ik}) supplement likelihood but ignore
covariance structure.
Harbor Analytics merchant segmentation refactor
Problem. Risk dashboards grouped merchants with K-means on standardized features: log monthly volume, chargeback rate, ticket-size coefficient of variation, and days-since-onboarding. Quarterly re-fits permuted cluster IDs (label switching) and centroids jumped when holiday volume spiked, confusing downstream rules.
Approach. Replace K-means with a 5-component diagonal-covariance
GMM trained via EM (n = 180{,}000, d = 4).
Initialization: K-means++ centroids for first M-step. Ten random restarts;
keep highest training log-likelihood. Map components to business names by
inspecting \mu_k and top merchants by \gamma_{ik}.
Results. Held-out log-likelihood improved 22% versus
K-means at the same K. Soft assignments let rules fire at
\gamma_{ik} > 0.6 instead of brittle hard borders.
Component ordering anchored by \mu_k volume dimension reduced
label-switching pain across refreshes. Anomaly queue added merchants with
p(x_i) < 10^{-4} under the mixture — catching
synthetic shell patterns K-means missed.
Ops. Nightly batch refit on rolling 90-day window; BIC
sweep every quarter to revisit K. Full-covariance tested but
overfit with n_k < 500 on the smallest component; diagonal
stuck.
Method decision table
| Goal | Prefer | Why not GMM? |
|---|---|---|
| Fast hard partitions, spherical clusters | K-means | GMM overhead unjustified; no density scores |
Arbitrary shapes, noise points, no K |
DBSCAN / HDBSCAN | GMM needs K; struggles with varying density |
| Elliptical clusters + probabilities + density | GMM + EM | — |
| Sequential latent states over time | HMM with GMM emissions | Plain GMM ignores temporal structure |
Bayesian uncertainty on K and parameters |
Dirichlet-process mixture / VI | Frequentist GMM fixes K; point estimates only |
| High-dimensional embeddings (100+ dims) | PCA + GMM, or deep clustering | Full covariance explodes; EM unstable |
Common pitfalls
- Local optima: EM converges to different solutions per seed. Always run multiple restarts and compare log-likelihood.
- Singular covariances: a component collapsing onto a few
points yields non-invertible
\Sigma_k. Regularize or use diagonal/tied covariances. - Label switching: component indices are arbitrary across
runs. Anchor labels by sorted
\mu_kor overlap with a reference fit. - Unscaled features: EM is not scale-invariant when covariances are estimated. Standardize continuous inputs first (see feature scaling).
- Too many components in low
n: BIC helps, but empty components still waste iterations. Setmin_covar/ minimum weight thresholds. - Confusing soft labels with calibrated probabilities: GMM posteriors are model-based, not ground-truth class probabilities. Validate segments against business outcomes.
- Ignoring identifiability: overlapping Gaussians can be swapped without changing likelihood. Merge or regularize when components duplicate.
Practitioner checklist
- Standardize continuous features; inspect for heavy tails and outliers.
- Start with diagonal covariance unless
nper component is large. - Initialize with K-means++ or a small grid of seeds; keep best log-likelihood.
- Sweep
Kwith BIC on held-out data, not training likelihood alone. - Check effective counts
N_k— components below ~50 points need scrutiny. - Apply covariance regularization to avoid singular matrices.
- Name components by inspecting centroids and high-responsibility exemplars.
- Define a label-anchoring scheme before production refreshes.
- Expose soft weights to downstream systems that benefit from uncertainty.
- Log anomaly scores
p(x)for points below a tuned threshold.
Key takeaways
- GMMs model data as a weighted sum of Gaussian components with latent cluster assignments.
- EM alternates soft assignment (E-step) and weighted parameter updates (M-step), monotonically improving likelihood.
- K-means is a hard, spherical special case of EM on a constrained GMM.
- Covariance structure and regularization matter more than increasing K alone.
- Multiple random restarts and BIC on hold-out data keep GMM fits stable and interpretable.
Related reading
- Unsupervised learning and clustering explained — K-means, DBSCAN, and when hard partitions suffice
- Hidden Markov models explained — Baum-Welch EM for sequential latent states with GMM emissions
- Variational inference explained — Bayesian mixtures when you need posterior uncertainty on parameters
- Anomaly detection explained — scoring rare events with density models and isolation forests