Guide

Markov chains explained

Will tomorrow's weather depend on every storm from the past decade, or mainly on today's sky? A Markov chain assumes the latter: the future depends only on the present state, not the full history. That Markov property (memorylessness) turns messy stochastic processes into tractable math — transition matrices, steady-state distributions, and simulation — powering PageRank, speech recognition, queueing models, and the state dynamics inside reinforcement learning. This guide explains discrete-time chains, transition matrices, stationary distributions, hidden Markov models, a Harbor Courier delivery-tracking worked example, a model decision table, common pitfalls, and a practical checklist. For Bayesian updates on beliefs, see Bayesian inference explained; for forecasting ordered data, see time series forecasting.

What a Markov chain is

A Markov chain is a sequence of random variables X0, X1, X2, ... where each variable takes a value from a state space S. The defining rule:

P(Xt+1 = j | Xt = i, Xt-1, ..., X0) = P(Xt+1 = j | Xt = i)

Knowing the current state i is sufficient; earlier states add no predictive power. States might be weather categories (sunny, rainy), web pages in a crawl graph, inventory levels, or game board positions. Time steps can be hours, user clicks, or delivery scan events — whatever fits the problem.

Discrete vs continuous time

This guide focuses on discrete-time chains: transitions happen at integer steps t = 0, 1, 2, .... Continuous-time Markov chains (used in queueing theory and chemical kinetics) specify transition rates rather than per-step probabilities. The intuition is identical; only the clock changes.

Transition matrices

When the state space is finite with n states, all transition probabilities fit in an n × n transition matrix P. Entry Pij is the probability of moving from state i to state j in one step. Each row sums to 1. Multiply a row vector of state probabilities π by P to advance one step: π(t+1) = π(t) P.

After k steps, π(t+k) = π(t) Pk. Matrix powers reveal multi-step behavior without simulating every hop. If you start in state i with probability 1, the i-th row of Pk gives the distribution after k steps.

Example: two-state weather chain

Suppose sunny (S) and rainy (R) with transitions: from S, stay sunny with 0.8 and turn rainy with 0.2; from R, stay rainy with 0.6 and clear with 0.4.

P = [[0.8, 0.2], [0.6, 0.4]] (rows S, R). If today is sunny, tomorrow's distribution is [0.8, 0.2]. After many days, the distribution stabilizes regardless of today's weather — that limit is the stationary distribution.

Stationary distributions and ergodicity

A distribution π is stationary if π P = π — once you reach it, you stay there in expectation. Equivalently, π is a left eigenvector of P with eigenvalue 1. For irreducible, aperiodic finite chains, a unique stationary distribution exists and powers of P converge to it from any start. That is ergodic behavior: long-run proportions of time spent in each state match π.

Not every chain is nice. Reducible chains have disconnected components — where you end up depends on where you started. Periodic chains oscillate (e.g., alternating between two states every step) and may not converge to a single distribution. Absorbing states (Pii = 1) trap the process forever once entered — useful for modeling completion events, fatal in models that need ongoing dynamics.

Computing stationary distributions

Solve π P = π with ∑ πi = 1, or find the eigenvector for eigenvalue 1. For small matrices, hand algebra works; for large sparse graphs (web links), power iteration — repeatedly multiply an initial vector by P until convergence — is standard. Google's PageRank is a Markov chain on web pages with a teleport term so every page stays reachable.

Hidden Markov models (HMMs)

Often you observe outputs that depend on a hidden state. In speech recognition, the hidden state is a phoneme; the observation is acoustic features. A hidden Markov model adds:

  • Transition matrix P — how hidden states evolve (same as a plain Markov chain).
  • Emission probabilities — P(observation | state) for each state.
  • Initial distribution — where the chain starts.

Three classic inference tasks: (1) filtering/smoothing — estimate the most likely hidden state sequence given observations (Viterbi algorithm); (2) likelihood — score how well the model explains data (forward algorithm); (3) learning — estimate P and emissions from sequences (Baum-Welch, an EM variant). HMMs preceded deep learning in NLP and bioinformatics; modern sequence models often replace them with RNNs and transformers, but HMMs remain interpretable baselines for short sequences and low data.

Where Markov chains appear in machine learning

  • Reinforcement learning — environments are often formalized as Markov decision processes (MDP): Markov state transitions plus rewards and actions. See our RL guide.
  • Text generation — n-gram language models are Markov chains over words (order-n Markov: next word depends on previous n-1). GPT-scale transformers break the finite-order assumption but inherit the sequential state idea.
  • MCMC sampling — Markov Chain Monte Carlo builds chains whose stationary distribution is a target posterior, enabling Bayesian inference on complex models.
  • Recommendation and graphs — random walks on user-item or page-link graphs produce embeddings and ranking scores.
  • Queueing and operations — customer arrivals, server states, and inventory levels are modeled as chains for capacity planning.

Worked example: Harbor Courier package tracking

Harbor Courier tracks parcels through five scan states: Accepted (A), InTransit (T), AtHub (H), OutForDelivery (D), and Delivered (X, absorbing). Historical scan logs estimate one-step transition probabilities per business day:

A → T (0.85), A → H (0.10), A → A (0.05); T → T (0.70), T → H (0.25), T → D (0.05); H → D (0.60), H → H (0.35), H → T (0.05); D → X (0.80), D → D (0.20); X → X (1.0).

Operations wants expected days until delivery for a package currently InTransit. Engineers build matrix P (5×5), compute Pk until row T has > 99% mass on X, or solve a system of expected hitting times. They find mean 2.4 days from T, 1.1 days from D — informing SLA promises and hub staffing. A naive forecast that averages all packages regardless of current state would over-promise; conditioning on state is the Markov insight.

When Harbor adds a WeatherDelay hidden state (observed only through longer-than-usual T→T self-loops), they fit an HMM: emissions are delay hours, hidden states are {Normal, Storm}. Viterbi decoding labels historical routes; the learned P adjusts ETA models during winter. This is simpler and more auditable than a black-box neural net for sparse regional data.

Model decision table

Problem shape States observed? Typical model When to use
Fixed rules, small state space Yes Discrete Markov chain + matrix P Queueing, inventory, click streams with known states
Sequential outputs, hidden causes No (hidden) Hidden Markov model Speech tags, regime detection, partial sensor data
Actions change transitions Yes Markov decision process (MDP) Robotics, game AI, pricing — see RL guide
High-dimensional, long context Partial RNN / Transformer Language, video — when Markov order is insufficient
Sample from posterior N/A MCMC (Metropolis-Hastings, Gibbs) Bayesian posteriors without closed form
Graph ranking / embeddings Yes (nodes) Random-walk Markov chain (PageRank) Web search, recommendation, knowledge graphs

Common pitfalls

  • Assuming Markovian data when memory matters — stock returns, user sessions with long-term preferences, and seasonal demand often need higher-order or exogenous variables.
  • Estimating P from too little data — rare transitions get 0 probability and block paths; use Laplace smoothing or Bayesian priors.
  • Ignoring non-stationarity — transition rates drift (new hub opens, policy change). Retrain P on rolling windows.
  • Confusing stationary distribution with short-run behavior — early transients dominate short horizons; steady-state answers the wrong question for same-day ETAs.
  • Absorbing states without planning — once Delivered is reached, the chain stops; models for ongoing churn need recurrent non-absorbing formulations.
  • Overfitting HMM state count — too many hidden states fit noise; use BIC or held-out likelihood to pick K.
  • Periodic chains — checking Pk for k = 1..10 reveals oscillation that averages hide.

Practical checklist

  • Define states so the Markov property is approximately true — merge history into state if needed (e.g., "2 days in transit" vs "in transit").
  • Verify each row of P sums to 1; clip negative estimates from noisy counts.
  • Plot Pk for increasing k — check convergence vs periodicity.
  • Compute stationary π and compare to empirical long-run frequencies as a sanity check.
  • For hitting times (time to absorption), use linear systems or simulation with confidence intervals.
  • When states are hidden, start with 2–3 HMM states before scaling complexity.
  • Document non-stationarity triggers (season, product launch) for model refresh.
  • Benchmark against naive baselines: last-state persistence, global average.
  • For ML pipelines, export P and π — stakeholders often trust interpretable matrices over opaque scores.
  • Link to domain simulators: if actions matter, upgrade chain to MDP and RL tooling.

Key takeaways

  • Markov chains model sequences where only the current state predicts the next step.
  • The transition matrix P encodes all one-step dynamics; powers of P give multi-step forecasts.
  • Stationary distributions describe long-run behavior when the chain is ergodic.
  • HMMs extend chains to hidden states with observable emissions — classic tool for sequence labeling.
  • Markov structure underpins RL, PageRank, MCMC, and operational forecasting — a foundational bridge between probability and machine learning.

Related reading