Guide

Gaussian mixture models and the EM algorithm explained

Harbor Analytics segments 180,000 merchants by monthly transaction volume, chargeback rate, and average ticket size to route risk reviews. A K-means model with K = 5 produced hard labels, but centroids drifted when seasonal volume shifted and elliptical merchant cohorts were forced into spherical partitions. Risk analysts wanted probabilities — “72% High-Velocity Retail, 28% Seasonal Pop-up” — not brittle winner-take-all assignments. A Gaussian mixture model (GMM) treats each data point as drawn from one of K multivariate Gaussian components with unknown mixing weights. The expectation-maximization (EM) algorithm alternates between estimating soft component memberships (E-step) and updating means, covariances, and weights (M-step) until log-likelihood plateaus. Harbor’s 5-component full-covariance GMM improved held-out log-likelihood by 22% over K-means and stabilized segment dashboards across quarterly refreshes. This guide explains the GMM generative story, derives EM intuition step by step, connects GMM to K-means and Baum-Welch EM in HMMs, walks the Harbor merchant refactor, contrasts GMM with K-means, DBSCAN, and variational inference, lists pitfalls, and ends with a practitioner checklist.

What a GMM assumes about your data

A GMM posits that observations x_1, \ldots, x_n \in \mathbb{R}^d are i.i.d. samples from a mixture density:

p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)

where \pi_k \ge 0, \sum_k \pi_k = 1, and each component is a multivariate Gaussian with mean \mu_k and covariance \Sigma_k. Latent indicator variables z_i \in \{1,\ldots,K\} say which component generated point x_i, but z_i is unobserved. Marginalizing over z yields the mixture above.

GMMs serve three overlapping jobs:

Soft clustering — posterior p(z_i = k \mid x_i) gives membership probabilities instead of hard labels.
Density estimation — the fitted mixture is a smooth probability model you can score, sample from, and use for anomaly detection (low p(x) flags outliers).
Feature of larger models — GMM emissions appear inside HMMs, speaker diarization, and some Bayesian pipelines as flexible observation models.

Unlike K-means, GMMs model covariance: elongated clusters at an angle are one component, not a distortion of spherical partitions.

EM: maximize likelihood with missing labels

Directly maximizing log-likelihood \sum_i \log \sum_k \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) has no closed-form solution because the sum is inside the log. EM treats latent z_i as missing data and iterates:

E-step (expectation)

Given current parameters \theta = \{\pi_k, \mu_k, \Sigma_k\}, compute responsibilities — the posterior probability that point i belongs to component k:

\gamma_{ik} = p(z_i = k \mid x_i, \theta) = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)} {\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)}

Each row \gamma_{i\cdot} sums to 1. Hard clustering would take \arg\max_k \gamma_{ik}; EM keeps the soft weights.

M-step (maximization)

Update parameters as weighted maximum-likelihood estimates:

N_k = \sum_i \gamma_{ik} (effective count in component k)
\pi_k \leftarrow N_k / n
\mu_k \leftarrow \frac{1}{N_k} \sum_i \gamma_{ik} x_i
\Sigma_k \leftarrow \frac{1}{N_k} \sum_i \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^\top

Each EM iteration is guaranteed to increase (or leave unchanged) the observed-data log-likelihood. Convergence is declared when improvement falls below a tolerance or iteration cap is hit. EM finds local optima, not global ones — multiple random initializations are standard practice.

Connection to K-means

K-means is EM on a GMM with equal spherical covariances \Sigma_k = \sigma^2 I and hard assignments: the E-step becomes nearest-centroid, the M-step updates centroids as cluster means. GMM generalizes this with ellipses, mixing weights, and probabilistic memberships.

Covariance parameterizations and regularization

Full \Sigma_k has d(d+1)/2 parameters per component. In high dimensions with modest n, covariances become singular or ill-conditioned. Common restrictions:

Spherical — \Sigma_k = \sigma_k^2 I; one variance per component (K-means-like).
Diagonal — axis-aligned ellipses; scales to moderate d with fewer parameters.
Tied — all components share one \Sigma; useful when clusters differ in location but not shape.
Full — arbitrary ellipses; needs enough data per component.

Add covariance regularization (e.g. \Sigma_k + \epsilon I) to prevent collapse when a component receives few points. Libraries like scikit-learn expose reg_covar for this reason.

Choosing K: BIC, AIC, and stability

More components always fit training data better, so you need a penalty for complexity:

BIC (Bayesian Information Criterion) — \mathrm{BIC} = -2 \log \hat{L} + p \log n, where \hat{L} is maximized likelihood and p counts free parameters. Lower is better; favors parsimony.
AIC — replaces \log n with 2; less penalty, often picks more components.

Run EM from several random seeds for each candidate K, keep the best log-likelihood, compare BIC on a held-out set. Also inspect stability: if two seeds yield different segment semantics, the model is under-identified. Silhouette scores on hard assignments (\arg\max_k \gamma_{ik}) supplement likelihood but ignore covariance structure.

Harbor Analytics merchant segmentation refactor

Problem. Risk dashboards grouped merchants with K-means on standardized features: log monthly volume, chargeback rate, ticket-size coefficient of variation, and days-since-onboarding. Quarterly re-fits permuted cluster IDs (label switching) and centroids jumped when holiday volume spiked, confusing downstream rules.

Approach. Replace K-means with a 5-component diagonal-covariance GMM trained via EM (n = 180{,}000, d = 4). Initialization: K-means++ centroids for first M-step. Ten random restarts; keep highest training log-likelihood. Map components to business names by inspecting \mu_k and top merchants by \gamma_{ik}.

Results. Held-out log-likelihood improved 22% versus K-means at the same K. Soft assignments let rules fire at \gamma_{ik} > 0.6 instead of brittle hard borders. Component ordering anchored by \mu_k volume dimension reduced label-switching pain across refreshes. Anomaly queue added merchants with p(x_i) < 10^{-4} under the mixture — catching synthetic shell patterns K-means missed.

Ops. Nightly batch refit on rolling 90-day window; BIC sweep every quarter to revisit K. Full-covariance tested but overfit with n_k < 500 on the smallest component; diagonal stuck.

Method decision table

Goal	Prefer	Why not GMM?
Fast hard partitions, spherical clusters	K-means	GMM overhead unjustified; no density scores
Arbitrary shapes, noise points, no `K`	DBSCAN / HDBSCAN	GMM needs `K`; struggles with varying density
Elliptical clusters + probabilities + density	GMM + EM	—
Sequential latent states over time	HMM with GMM emissions	Plain GMM ignores temporal structure
Bayesian uncertainty on `K` and parameters	Dirichlet-process mixture / VI	Frequentist GMM fixes `K`; point estimates only
High-dimensional embeddings (100+ dims)	PCA + GMM, or deep clustering	Full covariance explodes; EM unstable

Common pitfalls

Local optima: EM converges to different solutions per seed. Always run multiple restarts and compare log-likelihood.
Singular covariances: a component collapsing onto a few points yields non-invertible \Sigma_k. Regularize or use diagonal/tied covariances.
Label switching: component indices are arbitrary across runs. Anchor labels by sorted \mu_k or overlap with a reference fit.
Unscaled features: EM is not scale-invariant when covariances are estimated. Standardize continuous inputs first (see feature scaling).
Too many components in low n: BIC helps, but empty components still waste iterations. Set min_covar / minimum weight thresholds.
Confusing soft labels with calibrated probabilities: GMM posteriors are model-based, not ground-truth class probabilities. Validate segments against business outcomes.
Ignoring identifiability: overlapping Gaussians can be swapped without changing likelihood. Merge or regularize when components duplicate.

Practitioner checklist

Standardize continuous features; inspect for heavy tails and outliers.
Start with diagonal covariance unless n per component is large.
Initialize with K-means++ or a small grid of seeds; keep best log-likelihood.
Sweep K with BIC on held-out data, not training likelihood alone.
Check effective counts N_k — components below ~50 points need scrutiny.
Apply covariance regularization to avoid singular matrices.
Name components by inspecting centroids and high-responsibility exemplars.
Define a label-anchoring scheme before production refreshes.
Expose soft weights to downstream systems that benefit from uncertainty.
Log anomaly scores p(x) for points below a tuned threshold.

Key takeaways

GMMs model data as a weighted sum of Gaussian components with latent cluster assignments.
EM alternates soft assignment (E-step) and weighted parameter updates (M-step), monotonically improving likelihood.
K-means is a hard, spherical special case of EM on a constrained GMM.
Covariance structure and regularization matter more than increasing K alone.
Multiple random restarts and BIC on hold-out data keep GMM fits stable and interpretable.