Guide

Hidden Markov models explained

Harbor Analytics monitored centrifugal pumps with fixed vibration thresholds. Normal load spikes tripped false alarms; gradual bearing wear stayed below the line until catastrophic failure. Maintenance crews learned to ignore alerts. The reliability team reframed the problem as a hidden Markov model (HMM): each hour the pump occupies a hidden state — {Healthy, Degrading, Failing} — that cannot be measured directly, but emits observable vibration spectra drawn from state-dependent distributions. States evolve according to a Markov transition matrix (wear progresses, rarely reverses). After Baum-Welch training on two years of labeled outages, Viterbi decoding flagged Degrading pumps a median of 11 days before failure versus 3 days with thresholds, and false positives fell 41%. An HMM is the standard probabilistic model for sequences where the generating mechanism is latent: you specify hidden states, transition probabilities, and emission distributions, then solve three problems — likelihood, decoding, and learning. This guide builds the HMM tuple, walks the forward-backward, Viterbi, and Baum-Welch algorithms, covers discrete and Gaussian emissions, works the Harbor pump case, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For visible-state Markov processes without emissions, see Markov chains; for sequential decisions with actions and rewards, see Markov decision processes.

The HMM tuple: (λ, A, B, π)

A discrete-time HMM with N hidden states and observation alphabet V (or continuous emission family) is defined by:

  • State set S = {1, …, N}: latent regimes you cannot observe directly. Harbor uses N = 3: Healthy, Degrading, Failing.
  • Initial distribution π: P(q1 = i) for each state i. New pumps start almost entirely in Healthy.
  • Transition matrix A: aij = P(qt+1 = j | qt = i). Rows sum to 1. Wear models enforce aHealthy,Healthy high and aFailing,Healthy ≈ 0.
  • Emission matrix or density B: bj(o) = P(ot = o | qt = j) for discrete observations, or a parametric density (Gaussian, Poisson) for continuous or count data. Harbor models log-power in three frequency bands as a multivariate Gaussian per state.

A joint sequence is hidden states Q = q1, …, qT and observations O = o1, …, oT. The generative story: sample q1 ~ π, emit o1 ~ bq1, then for t = 2 … T sample qt ~ aqt-1, · and emit ot ~ bqt. The Markov property applies to hidden states: the next state depends only on the current state, not the full history.

Why “hidden”?

Observations are noisy projections of latent structure. Two consecutive hours of elevated vibration might mean a passing load surge (Healthy) or irreversible wear onset (Degrading). The HMM marginalizes over state uncertainty instead of committing to a single hard label per timestep.

Three inference problems

Rabiner's formulation organizes HMM work into three problems:

  1. Evaluation (likelihood): given model λ = (A, B, π) and observation sequence O, compute P(O | λ). Used for model comparison and anomaly scoring.
  2. Decoding: find the most likely hidden path Q* = argmaxQ P(Q, O | λ). Solved by the Viterbi algorithm (dynamic programming in O(N²T)).
  3. Learning: estimate λ from training sequences {O(m)} with or without labeled states. Baum-Welch is EM specialized to HMMs.

Forward algorithm (evaluation)

Define forward variable αt(i) = P(o1, …, ot, qt = i | λ). Recurse: α1(i) = πi bi(o1); αt(j) = [∑i αt-1(i) aij] bj(ot). Then P(O | λ) = ∑i αT(i). Work in log-space or apply scaling factors to avoid underflow on long sequences.

Forward-backward (posterior state occupation)

The backward pass computes βt(i) — probability of future observations given state i at time t. Combining forward and backward yields γt(i) = P(qt = i | O, λ) and ξt(i, j) = P(qt = i, qt+1 = j | O, λ), which Baum-Welch uses in the M-step.

Viterbi (decoding)

Replace sums with maxes. Maintain vt(j) = maxq1…qt-1 P(q1, …, qt-1, qt = j, o1, …, ot | λ) and backpointers. The recovered path is the single best state explanation — not the same as marginal argmax per timestep, which can yield inconsistent transitions.

Baum-Welch: EM for HMMs

When hidden states are unlabeled, Baum-Welch alternates:

  • E-step: run forward-backward on each training sequence to estimate expected state occupations γ and transition counts ξ.
  • M-step: update π, A, and B to maximize expected complete-data log-likelihood. Transition updates are weighted counts; emission updates depend on the emission family (multinomial MLE for discrete, Gaussian mean/covariance for continuous).

EM climbs to a local optimum — initialization matters. Common strategies: random restarts, k-means on observation features to seed states, or domain-informed priors (e.g. left-to-right topology for speech). Convergence is monitored via log-likelihood plateaus; set max iterations and early-stop tolerance.

Supervised learning shortcut

When historical state labels exist (post-mortem failure analysis, hand-annotated POS tags), estimate A and B by counting transitions and emissions directly — no EM required. Harbor used 14 labeled outages plus Baum-Welch on 200 unlabeled pumps to stabilize rare Failing emissions.

Topology constraints

Ergonomic (fully connected), left-to-right (states only advance — speech phonemes, wear progression), and bakis (skip connections) reduce parameters and overfitting. Forbidden transitions are zeroed in A before training.

Discrete vs Gaussian emissions

Discrete HMMs suit categorical observations: part-of-speech tags from words, DNA nucleotides, regime labels discretized into bins. Emission matrix B has shape N × |V|.

Gaussian HMMs model continuous vectors (sensor readings, MFCC frames, log-returns). Each state carries mean vector μj and covariance Σj. Mixture-of-Gaussians HMMs (GMM-HMM) add mixture components per state — the classical acoustic model in speech recognition before deep learning.

High-dimensional continuous observations often need dimensionality reduction (PCA on spectra) or diagonal covariances to keep parameters identifiable with limited data.

Harbor Analytics: pump health monitoring

Problem: 200 centrifugal pumps across three plants; hourly vibration snapshots in three log-power bands. Goal: early Degrading alert without crying wolf on load noise.

Model: N = 3 left-to-right states; 3D Gaussian emissions per state; A constrained so Failing cannot return to Healthy. Training: Baum-Welch on 18 months of unlabeled hourly data, initialized from k-means (k = 3) on band features; M-step refreshed with 14 labeled failure trajectories.

Deployment: hourly forward pass computes γt(Degrading); alert when posterior exceeds 0.6 for two consecutive hours. Viterbi paths used in post-incident review, not live paging. Anomaly score = negative log P(O | λ) for pumps with no recent maintenance.

Results: median lead time 11 days (vs 3 with static thresholds); false positive rate 4.2% (vs 7.1%); one missed failure traced to a sensor dropout — fixed by adding a missing-data mask in the forward pass.

Method decision table

Your situation HMM Markov chain (observed states) RNN / Transformer Change-point detection
Latent regimes, modest N, limited labels Strong fit Wrong tool (no hidden layer) Overkill; needs more data Partial (no state persistence model)
Long sequences, huge vocabulary (text, audio) Struggles (sparse emissions) N/A Strong fit N/A
Known state sequence, count transitions Supervised MLE on A, B Strong fit Overkill N/A
Real-time decoding on embedded hardware Fast (O(N²T)) Trivial Heavy inference Varies
Irregular timestamps, variable gaps Needs extension (input-output HMM) Discrete-time assumption More flexible Moderate
Interpretable regime labels for auditors States are explicit Moderate Black box Breakpoints only

Common pitfalls

  • Too many hidden states: EM fits noise; validation log-likelihood peaks then drops. Use BIC or held-out sequences to pick N.
  • Numeric underflow: multiplying thousands of probabilities yields zero. Always scale forward variables or accumulate log-probabilities with log-sum-exp.
  • Label switching: during unsupervised training, state indices permute between restarts. Align states post-hoc via domain semantics or canonical ordering.
  • Non-stationary emissions: seasonal load shifts change vibration baselines; retrain quarterly or add exogenous inputs (temperature, throughput) to emissions.
  • Confusing Viterbi with marginals: argmax per-timestep marginals can imply impossible transitions; use Viterbi for path explanations.
  • Ignoring missing observations: sensor gaps break the recurrence; mask timesteps or impute with care.
  • Equating HMMs with Markov chains: observed-state chains cannot model emission noise; HMMs cannot model action-dependent transitions (that is an MDP/POMDP).

Practitioner checklist

  • Define hidden states with domain meaning; constrain topology if progression is one-way.
  • Choose discrete vs Gaussian (or GMM) emissions based on observation type and dimensionality.
  • Split data by unit (pump, customer, session) — never shuffle timesteps across entities.
  • Initialize EM from k-means or supervised counts; run multiple restarts and keep best log-likelihood.
  • Validate on held-out sequences: log-likelihood, state-aligned precision/recall if labels exist.
  • Implement forward/backward in log-space; unit-test against a tiny hand-checked example.
  • Deploy posterior thresholds or Viterbi paths depending on whether you need soft scores or single explanations.
  • Monitor emission drift; schedule retraining when likelihood on recent data degrades.
  • Document N, topology, and state semantics for operations and compliance teams.
  • Benchmark against a simple baseline (thresholds, HMM with N = 2) before claiming lift.

Key takeaways

  • An HMM couples a Markov chain over hidden states with state-conditional emission distributions.
  • Three problems — likelihood, decoding, learning — map to forward, Viterbi, and Baum-Welch algorithms.
  • Baum-Welch is EM: E-step computes expected state occupation; M-step updates π, A, and B.
  • Topology constraints and careful state count selection prevent overfitting on short sequences.
  • HMMs remain strong baselines for interpretable regime detection; deep sequence models win on scale but sacrifice transparency.

Related reading