Guide
Hidden Markov models explained
Harbor Analytics monitored centrifugal pumps with fixed vibration thresholds. Normal load spikes tripped false alarms; gradual bearing wear stayed below the line until catastrophic failure. Maintenance crews learned to ignore alerts. The reliability team reframed the problem as a hidden Markov model (HMM): each hour the pump occupies a hidden state — {Healthy, Degrading, Failing} — that cannot be measured directly, but emits observable vibration spectra drawn from state-dependent distributions. States evolve according to a Markov transition matrix (wear progresses, rarely reverses). After Baum-Welch training on two years of labeled outages, Viterbi decoding flagged Degrading pumps a median of 11 days before failure versus 3 days with thresholds, and false positives fell 41%. An HMM is the standard probabilistic model for sequences where the generating mechanism is latent: you specify hidden states, transition probabilities, and emission distributions, then solve three problems — likelihood, decoding, and learning. This guide builds the HMM tuple, walks the forward-backward, Viterbi, and Baum-Welch algorithms, covers discrete and Gaussian emissions, works the Harbor pump case, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For visible-state Markov processes without emissions, see Markov chains; for sequential decisions with actions and rewards, see Markov decision processes.
The HMM tuple: (λ, A, B, π)
A discrete-time HMM with N hidden states and observation alphabet
V (or continuous emission family) is defined by:
- State set S = {1, …, N}: latent regimes you cannot observe directly. Harbor uses N = 3: Healthy, Degrading, Failing.
- Initial distribution π:
P(q1 = i)for each statei. New pumps start almost entirely in Healthy. - Transition matrix A:
aij = P(qt+1 = j | qt = i). Rows sum to 1. Wear models enforceaHealthy,Healthyhigh andaFailing,Healthy ≈ 0. - Emission matrix or density B:
bj(o) = P(ot = o | qt = j)for discrete observations, or a parametric density (Gaussian, Poisson) for continuous or count data. Harbor models log-power in three frequency bands as a multivariate Gaussian per state.
A joint sequence is hidden states
Q = q1, …, qT and observations
O = o1, …, oT. The generative story:
sample q1 ~ π, emit o1 ~ bq1,
then for t = 2 … T sample qt ~ aqt-1, ·
and emit ot ~ bqt. The
Markov property applies to hidden states: the next state
depends only on the current state, not the full history.
Why “hidden”?
Observations are noisy projections of latent structure. Two consecutive hours of elevated vibration might mean a passing load surge (Healthy) or irreversible wear onset (Degrading). The HMM marginalizes over state uncertainty instead of committing to a single hard label per timestep.
Three inference problems
Rabiner's formulation organizes HMM work into three problems:
- Evaluation (likelihood): given model λ = (A, B, π) and observation sequence O, compute
P(O | λ). Used for model comparison and anomaly scoring. - Decoding: find the most likely hidden path
Q*= argmaxQ P(Q, O | λ). Solved by the Viterbi algorithm (dynamic programming in O(N²T)). - Learning: estimate λ from training sequences {O(m)} with or without labeled states. Baum-Welch is EM specialized to HMMs.
Forward algorithm (evaluation)
Define forward variable αt(i) = P(o1, …, ot, qt = i | λ).
Recurse:
α1(i) = πi bi(o1);
αt(j) = [∑i αt-1(i) aij] bj(ot).
Then P(O | λ) = ∑i αT(i).
Work in log-space or apply scaling factors to avoid underflow on long sequences.
Forward-backward (posterior state occupation)
The backward pass computes βt(i) — probability
of future observations given state i at time t.
Combining forward and backward yields
γt(i) = P(qt = i | O, λ) and
ξt(i, j) = P(qt = i, qt+1 = j | O, λ),
which Baum-Welch uses in the M-step.
Viterbi (decoding)
Replace sums with maxes. Maintain
vt(j) = maxq1…qt-1 P(q1, …, qt-1, qt = j, o1, …, ot | λ)
and backpointers. The recovered path is the single best state explanation —
not the same as marginal argmax per timestep, which can yield inconsistent
transitions.
Baum-Welch: EM for HMMs
When hidden states are unlabeled, Baum-Welch alternates:
- E-step: run forward-backward on each training sequence to estimate expected state occupations γ and transition counts ξ.
- M-step: update π, A, and B to maximize expected complete-data log-likelihood. Transition updates are weighted counts; emission updates depend on the emission family (multinomial MLE for discrete, Gaussian mean/covariance for continuous).
EM climbs to a local optimum — initialization matters. Common strategies: random restarts, k-means on observation features to seed states, or domain-informed priors (e.g. left-to-right topology for speech). Convergence is monitored via log-likelihood plateaus; set max iterations and early-stop tolerance.
Supervised learning shortcut
When historical state labels exist (post-mortem failure analysis, hand-annotated POS tags), estimate A and B by counting transitions and emissions directly — no EM required. Harbor used 14 labeled outages plus Baum-Welch on 200 unlabeled pumps to stabilize rare Failing emissions.
Topology constraints
Ergonomic (fully connected), left-to-right (states only advance — speech phonemes, wear progression), and bakis (skip connections) reduce parameters and overfitting. Forbidden transitions are zeroed in A before training.
Discrete vs Gaussian emissions
Discrete HMMs suit categorical observations: part-of-speech tags from words, DNA nucleotides, regime labels discretized into bins. Emission matrix B has shape N × |V|.
Gaussian HMMs model continuous vectors (sensor readings, MFCC frames, log-returns). Each state carries mean vector μj and covariance Σj. Mixture-of-Gaussians HMMs (GMM-HMM) add mixture components per state — the classical acoustic model in speech recognition before deep learning.
High-dimensional continuous observations often need dimensionality reduction (PCA on spectra) or diagonal covariances to keep parameters identifiable with limited data.
Harbor Analytics: pump health monitoring
Problem: 200 centrifugal pumps across three plants; hourly vibration snapshots in three log-power bands. Goal: early Degrading alert without crying wolf on load noise.
Model: N = 3 left-to-right states; 3D Gaussian emissions per state; A constrained so Failing cannot return to Healthy. Training: Baum-Welch on 18 months of unlabeled hourly data, initialized from k-means (k = 3) on band features; M-step refreshed with 14 labeled failure trajectories.
Deployment: hourly forward pass computes
γt(Degrading); alert when posterior exceeds 0.6
for two consecutive hours. Viterbi paths used in post-incident review, not
live paging. Anomaly score = negative log P(O | λ) for
pumps with no recent maintenance.
Results: median lead time 11 days (vs 3 with static thresholds); false positive rate 4.2% (vs 7.1%); one missed failure traced to a sensor dropout — fixed by adding a missing-data mask in the forward pass.
Method decision table
| Your situation | HMM | Markov chain (observed states) | RNN / Transformer | Change-point detection |
|---|---|---|---|---|
| Latent regimes, modest N, limited labels | Strong fit | Wrong tool (no hidden layer) | Overkill; needs more data | Partial (no state persistence model) |
| Long sequences, huge vocabulary (text, audio) | Struggles (sparse emissions) | N/A | Strong fit | N/A |
| Known state sequence, count transitions | Supervised MLE on A, B | Strong fit | Overkill | N/A |
| Real-time decoding on embedded hardware | Fast (O(N²T)) | Trivial | Heavy inference | Varies |
| Irregular timestamps, variable gaps | Needs extension (input-output HMM) | Discrete-time assumption | More flexible | Moderate |
| Interpretable regime labels for auditors | States are explicit | Moderate | Black box | Breakpoints only |
Common pitfalls
- Too many hidden states: EM fits noise; validation log-likelihood peaks then drops. Use BIC or held-out sequences to pick N.
- Numeric underflow: multiplying thousands of probabilities yields zero. Always scale forward variables or accumulate log-probabilities with log-sum-exp.
- Label switching: during unsupervised training, state indices permute between restarts. Align states post-hoc via domain semantics or canonical ordering.
- Non-stationary emissions: seasonal load shifts change vibration baselines; retrain quarterly or add exogenous inputs (temperature, throughput) to emissions.
- Confusing Viterbi with marginals: argmax per-timestep marginals can imply impossible transitions; use Viterbi for path explanations.
- Ignoring missing observations: sensor gaps break the recurrence; mask timesteps or impute with care.
- Equating HMMs with Markov chains: observed-state chains cannot model emission noise; HMMs cannot model action-dependent transitions (that is an MDP/POMDP).
Practitioner checklist
- Define hidden states with domain meaning; constrain topology if progression is one-way.
- Choose discrete vs Gaussian (or GMM) emissions based on observation type and dimensionality.
- Split data by unit (pump, customer, session) — never shuffle timesteps across entities.
- Initialize EM from k-means or supervised counts; run multiple restarts and keep best log-likelihood.
- Validate on held-out sequences: log-likelihood, state-aligned precision/recall if labels exist.
- Implement forward/backward in log-space; unit-test against a tiny hand-checked example.
- Deploy posterior thresholds or Viterbi paths depending on whether you need soft scores or single explanations.
- Monitor emission drift; schedule retraining when likelihood on recent data degrades.
- Document N, topology, and state semantics for operations and compliance teams.
- Benchmark against a simple baseline (thresholds, HMM with N = 2) before claiming lift.
Key takeaways
- An HMM couples a Markov chain over hidden states with state-conditional emission distributions.
- Three problems — likelihood, decoding, learning — map to forward, Viterbi, and Baum-Welch algorithms.
- Baum-Welch is EM: E-step computes expected state occupation; M-step updates π, A, and B.
- Topology constraints and careful state count selection prevent overfitting on short sequences.
- HMMs remain strong baselines for interpretable regime detection; deep sequence models win on scale but sacrifice transparency.
Related reading
- Markov chains explained — transition matrices, stationary distributions and the visible-state special case
- Speech recognition explained — acoustic modeling where GMM-HMMs dominated before end-to-end neural nets
- Time series forecasting explained — ARIMA, seasonality and when generative state models beat direct forecasting
- Unsupervised learning and clustering explained — k-means initialization and EM family connections