Guide

Hidden Markov models explained

Harbor Analytics monitored centrifugal pumps with fixed vibration thresholds. Normal load spikes tripped false alarms; gradual bearing wear stayed below the line until catastrophic failure. Maintenance crews learned to ignore alerts. The reliability team reframed the problem as a hidden Markov model (HMM): each hour the pump occupies a hidden state — {Healthy, Degrading, Failing} — that cannot be measured directly, but emits observable vibration spectra drawn from state-dependent distributions. States evolve according to a Markov transition matrix (wear progresses, rarely reverses). After Baum-Welch training on two years of labeled outages, Viterbi decoding flagged Degrading pumps a median of 11 days before failure versus 3 days with thresholds, and false positives fell 41%. An HMM is the standard probabilistic model for sequences where the generating mechanism is latent: you specify hidden states, transition probabilities, and emission distributions, then solve three problems — likelihood, decoding, and learning. This guide builds the HMM tuple, walks the forward-backward, Viterbi, and Baum-Welch algorithms, covers discrete and Gaussian emissions, works the Harbor pump case, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For visible-state Markov processes without emissions, see Markov chains; for sequential decisions with actions and rewards, see Markov decision processes.

The HMM tuple: (λ, A, B, π)

A discrete-time HMM with N hidden states and observation alphabet V (or continuous emission family) is defined by:

State set S = {1, …, N}: latent regimes you cannot observe directly. Harbor uses N = 3: Healthy, Degrading, Failing.
Initial distribution π: P(q₁ = i) for each state i. New pumps start almost entirely in Healthy.
Transition matrix A: a_ij = P(q_t+1 = j | q_t = i). Rows sum to 1. Wear models enforce a_{Healthy,Healthy} high and a_{Failing,Healthy} ≈ 0.
Emission matrix or density B: b_j(o) = P(o_t = o | q_t = j) for discrete observations, or a parametric density (Gaussian, Poisson) for continuous or count data. Harbor models log-power in three frequency bands as a multivariate Gaussian per state.

A joint sequence is hidden states Q = q₁, …, q_T and observations O = o₁, …, o_T. The generative story: sample q₁ ~ π, emit o₁ ~ b_q₁, then for t = 2 … T sample q_t ~ a_{q_t-1, ·} and emit o_t ~ b_{q_t}. The Markov property applies to hidden states: the next state depends only on the current state, not the full history.

Why “hidden”?

Observations are noisy projections of latent structure. Two consecutive hours of elevated vibration might mean a passing load surge (Healthy) or irreversible wear onset (Degrading). The HMM marginalizes over state uncertainty instead of committing to a single hard label per timestep.

Three inference problems

Rabiner's formulation organizes HMM work into three problems:

Evaluation (likelihood): given model λ = (A, B, π) and observation sequence O, compute P(O | λ). Used for model comparison and anomaly scoring.
Decoding: find the most likely hidden path Q* = argmax_Q P(Q, O | λ). Solved by the Viterbi algorithm (dynamic programming in O(N²T)).
Learning: estimate λ from training sequences {O^(m)} with or without labeled states. Baum-Welch is EM specialized to HMMs.

Forward algorithm (evaluation)

Define forward variable α_t(i) = P(o₁, …, o_t, q_t = i | λ). Recurse: α₁(i) = π_i b_i(o₁); α_t(j) = [∑_i α_t-1(i) a_ij] b_j(o_t). Then P(O | λ) = ∑_i α_T(i). Work in log-space or apply scaling factors to avoid underflow on long sequences.

Forward-backward (posterior state occupation)

The backward pass computes β_t(i) — probability of future observations given state i at time t. Combining forward and backward yields γ_t(i) = P(q_t = i | O, λ) and ξ_t(i, j) = P(q_t = i, q_t+1 = j | O, λ), which Baum-Welch uses in the M-step.

Viterbi (decoding)

Replace sums with maxes. Maintain v_t(j) = max_{q₁…q_t-1} P(q₁, …, q_t-1, q_t = j, o₁, …, o_t | λ) and backpointers. The recovered path is the single best state explanation — not the same as marginal argmax per timestep, which can yield inconsistent transitions.

Baum-Welch: EM for HMMs

When hidden states are unlabeled, Baum-Welch alternates:

E-step: run forward-backward on each training sequence to estimate expected state occupations γ and transition counts ξ.
M-step: update π, A, and B to maximize expected complete-data log-likelihood. Transition updates are weighted counts; emission updates depend on the emission family (multinomial MLE for discrete, Gaussian mean/covariance for continuous).

EM climbs to a local optimum — initialization matters. Common strategies: random restarts, k-means on observation features to seed states, or domain-informed priors (e.g. left-to-right topology for speech). Convergence is monitored via log-likelihood plateaus; set max iterations and early-stop tolerance.

Supervised learning shortcut

When historical state labels exist (post-mortem failure analysis, hand-annotated POS tags), estimate A and B by counting transitions and emissions directly — no EM required. Harbor used 14 labeled outages plus Baum-Welch on 200 unlabeled pumps to stabilize rare Failing emissions.

Topology constraints

Ergonomic (fully connected), left-to-right (states only advance — speech phonemes, wear progression), and bakis (skip connections) reduce parameters and overfitting. Forbidden transitions are zeroed in A before training.

Discrete vs Gaussian emissions

Discrete HMMs suit categorical observations: part-of-speech tags from words, DNA nucleotides, regime labels discretized into bins. Emission matrix B has shape N × |V|.

Gaussian HMMs model continuous vectors (sensor readings, MFCC frames, log-returns). Each state carries mean vector μ_j and covariance Σ_j. Mixture-of-Gaussians HMMs (GMM-HMM) add mixture components per state — the classical acoustic model in speech recognition before deep learning.

High-dimensional continuous observations often need dimensionality reduction (PCA on spectra) or diagonal covariances to keep parameters identifiable with limited data.

Harbor Analytics: pump health monitoring

Problem: 200 centrifugal pumps across three plants; hourly vibration snapshots in three log-power bands. Goal: early Degrading alert without crying wolf on load noise.

Model: N = 3 left-to-right states; 3D Gaussian emissions per state; A constrained so Failing cannot return to Healthy. Training: Baum-Welch on 18 months of unlabeled hourly data, initialized from k-means (k = 3) on band features; M-step refreshed with 14 labeled failure trajectories.

Deployment: hourly forward pass computes γ_t(Degrading); alert when posterior exceeds 0.6 for two consecutive hours. Viterbi paths used in post-incident review, not live paging. Anomaly score = negative log P(O | λ) for pumps with no recent maintenance.

Results: median lead time 11 days (vs 3 with static thresholds); false positive rate 4.2% (vs 7.1%); one missed failure traced to a sensor dropout — fixed by adding a missing-data mask in the forward pass.

Method decision table

Your situation	HMM	Markov chain (observed states)	RNN / Transformer	Change-point detection
Latent regimes, modest N, limited labels	Strong fit	Wrong tool (no hidden layer)	Overkill; needs more data	Partial (no state persistence model)
Long sequences, huge vocabulary (text, audio)	Struggles (sparse emissions)	N/A	Strong fit	N/A
Known state sequence, count transitions	Supervised MLE on A, B	Strong fit	Overkill	N/A
Real-time decoding on embedded hardware	Fast (O(N²T))	Trivial	Heavy inference	Varies
Irregular timestamps, variable gaps	Needs extension (input-output HMM)	Discrete-time assumption	More flexible	Moderate
Interpretable regime labels for auditors	States are explicit	Moderate	Black box	Breakpoints only

Common pitfalls

Too many hidden states: EM fits noise; validation log-likelihood peaks then drops. Use BIC or held-out sequences to pick N.
Numeric underflow: multiplying thousands of probabilities yields zero. Always scale forward variables or accumulate log-probabilities with log-sum-exp.
Label switching: during unsupervised training, state indices permute between restarts. Align states post-hoc via domain semantics or canonical ordering.
Non-stationary emissions: seasonal load shifts change vibration baselines; retrain quarterly or add exogenous inputs (temperature, throughput) to emissions.
Confusing Viterbi with marginals: argmax per-timestep marginals can imply impossible transitions; use Viterbi for path explanations.
Ignoring missing observations: sensor gaps break the recurrence; mask timesteps or impute with care.
Equating HMMs with Markov chains: observed-state chains cannot model emission noise; HMMs cannot model action-dependent transitions (that is an MDP/POMDP).

Practitioner checklist

Define hidden states with domain meaning; constrain topology if progression is one-way.
Choose discrete vs Gaussian (or GMM) emissions based on observation type and dimensionality.
Split data by unit (pump, customer, session) — never shuffle timesteps across entities.
Initialize EM from k-means or supervised counts; run multiple restarts and keep best log-likelihood.
Validate on held-out sequences: log-likelihood, state-aligned precision/recall if labels exist.
Implement forward/backward in log-space; unit-test against a tiny hand-checked example.
Deploy posterior thresholds or Viterbi paths depending on whether you need soft scores or single explanations.
Monitor emission drift; schedule retraining when likelihood on recent data degrades.
Document N, topology, and state semantics for operations and compliance teams.
Benchmark against a simple baseline (thresholds, HMM with N = 2) before claiming lift.

Key takeaways

An HMM couples a Markov chain over hidden states with state-conditional emission distributions.
Three problems — likelihood, decoding, learning — map to forward, Viterbi, and Baum-Welch algorithms.
Baum-Welch is EM: E-step computes expected state occupation; M-step updates π, A, and B.
Topology constraints and careful state count selection prevent overfitting on short sequences.
HMMs remain strong baselines for interpretable regime detection; deep sequence models win on scale but sacrifice transparency.