Guide
Markov chains explained
Will tomorrow's weather depend on every storm from the past decade, or mainly on today's sky? A Markov chain assumes the latter: the future depends only on the present state, not the full history. That Markov property (memorylessness) turns messy stochastic processes into tractable math — transition matrices, steady-state distributions, and simulation — powering PageRank, speech recognition, queueing models, and the state dynamics inside reinforcement learning. This guide explains discrete-time chains, transition matrices, stationary distributions, hidden Markov models, a Harbor Courier delivery-tracking worked example, a model decision table, common pitfalls, and a practical checklist. For Bayesian updates on beliefs, see Bayesian inference explained; for forecasting ordered data, see time series forecasting.
What a Markov chain is
A Markov chain is a sequence of random variables X0, X1, X2, ... where each variable takes a value from a state space S. The defining rule:
P(Xt+1 = j | Xt = i, Xt-1, ..., X0) = P(Xt+1 = j | Xt = i)
Knowing the current state i is sufficient; earlier states add no predictive power. States might be weather categories (sunny, rainy), web pages in a crawl graph, inventory levels, or game board positions. Time steps can be hours, user clicks, or delivery scan events — whatever fits the problem.
Discrete vs continuous time
This guide focuses on discrete-time chains: transitions happen at integer steps t = 0, 1, 2, .... Continuous-time Markov chains (used in queueing theory and chemical kinetics) specify transition rates rather than per-step probabilities. The intuition is identical; only the clock changes.
Transition matrices
When the state space is finite with n states, all transition probabilities fit in an n × n transition matrix P. Entry Pij is the probability of moving from state i to state j in one step. Each row sums to 1. Multiply a row vector of state probabilities π by P to advance one step: π(t+1) = π(t) P.
After k steps, π(t+k) = π(t) Pk. Matrix powers reveal multi-step behavior without simulating every hop. If you start in state i with probability 1, the i-th row of Pk gives the distribution after k steps.
Example: two-state weather chain
Suppose sunny (S) and rainy (R) with transitions: from S, stay sunny with 0.8 and turn rainy with 0.2; from R, stay rainy with 0.6 and clear with 0.4.
P = [[0.8, 0.2], [0.6, 0.4]] (rows S, R). If today is sunny, tomorrow's distribution is [0.8, 0.2]. After many days, the distribution stabilizes regardless of today's weather — that limit is the stationary distribution.
Stationary distributions and ergodicity
A distribution π is stationary if π P = π — once you reach it, you stay there in expectation. Equivalently, π is a left eigenvector of P with eigenvalue 1. For irreducible, aperiodic finite chains, a unique stationary distribution exists and powers of P converge to it from any start. That is ergodic behavior: long-run proportions of time spent in each state match π.
Not every chain is nice. Reducible chains have disconnected components — where you end up depends on where you started. Periodic chains oscillate (e.g., alternating between two states every step) and may not converge to a single distribution. Absorbing states (Pii = 1) trap the process forever once entered — useful for modeling completion events, fatal in models that need ongoing dynamics.
Computing stationary distributions
Solve π P = π with ∑ πi = 1, or find the eigenvector for eigenvalue 1. For small matrices, hand algebra works; for large sparse graphs (web links), power iteration — repeatedly multiply an initial vector by P until convergence — is standard. Google's PageRank is a Markov chain on web pages with a teleport term so every page stays reachable.
Hidden Markov models (HMMs)
Often you observe outputs that depend on a hidden state. In speech recognition, the hidden state is a phoneme; the observation is acoustic features. A hidden Markov model adds:
- Transition matrix P — how hidden states evolve (same as a plain Markov chain).
- Emission probabilities — P(observation | state) for each state.
- Initial distribution — where the chain starts.
Three classic inference tasks: (1) filtering/smoothing — estimate the most likely hidden state sequence given observations (Viterbi algorithm); (2) likelihood — score how well the model explains data (forward algorithm); (3) learning — estimate P and emissions from sequences (Baum-Welch, an EM variant). HMMs preceded deep learning in NLP and bioinformatics; modern sequence models often replace them with RNNs and transformers, but HMMs remain interpretable baselines for short sequences and low data.
Where Markov chains appear in machine learning
- Reinforcement learning — environments are often formalized as Markov decision processes (MDP): Markov state transitions plus rewards and actions. See our RL guide.
- Text generation — n-gram language models are Markov chains over words (order-n Markov: next word depends on previous n-1). GPT-scale transformers break the finite-order assumption but inherit the sequential state idea.
- MCMC sampling — Markov Chain Monte Carlo builds chains whose stationary distribution is a target posterior, enabling Bayesian inference on complex models.
- Recommendation and graphs — random walks on user-item or page-link graphs produce embeddings and ranking scores.
- Queueing and operations — customer arrivals, server states, and inventory levels are modeled as chains for capacity planning.
Worked example: Harbor Courier package tracking
Harbor Courier tracks parcels through five scan states: Accepted (A), InTransit (T), AtHub (H), OutForDelivery (D), and Delivered (X, absorbing). Historical scan logs estimate one-step transition probabilities per business day:
A → T (0.85), A → H (0.10), A → A (0.05); T → T (0.70), T → H (0.25), T → D (0.05); H → D (0.60), H → H (0.35), H → T (0.05); D → X (0.80), D → D (0.20); X → X (1.0).
Operations wants expected days until delivery for a package currently InTransit. Engineers build matrix P (5×5), compute Pk until row T has > 99% mass on X, or solve a system of expected hitting times. They find mean 2.4 days from T, 1.1 days from D — informing SLA promises and hub staffing. A naive forecast that averages all packages regardless of current state would over-promise; conditioning on state is the Markov insight.
When Harbor adds a WeatherDelay hidden state (observed only through longer-than-usual T→T self-loops), they fit an HMM: emissions are delay hours, hidden states are {Normal, Storm}. Viterbi decoding labels historical routes; the learned P adjusts ETA models during winter. This is simpler and more auditable than a black-box neural net for sparse regional data.
Model decision table
| Problem shape | States observed? | Typical model | When to use |
|---|---|---|---|
| Fixed rules, small state space | Yes | Discrete Markov chain + matrix P | Queueing, inventory, click streams with known states |
| Sequential outputs, hidden causes | No (hidden) | Hidden Markov model | Speech tags, regime detection, partial sensor data |
| Actions change transitions | Yes | Markov decision process (MDP) | Robotics, game AI, pricing — see RL guide |
| High-dimensional, long context | Partial | RNN / Transformer | Language, video — when Markov order is insufficient |
| Sample from posterior | N/A | MCMC (Metropolis-Hastings, Gibbs) | Bayesian posteriors without closed form |
| Graph ranking / embeddings | Yes (nodes) | Random-walk Markov chain (PageRank) | Web search, recommendation, knowledge graphs |
Common pitfalls
- Assuming Markovian data when memory matters — stock returns, user sessions with long-term preferences, and seasonal demand often need higher-order or exogenous variables.
- Estimating P from too little data — rare transitions get 0 probability and block paths; use Laplace smoothing or Bayesian priors.
- Ignoring non-stationarity — transition rates drift (new hub opens, policy change). Retrain P on rolling windows.
- Confusing stationary distribution with short-run behavior — early transients dominate short horizons; steady-state answers the wrong question for same-day ETAs.
- Absorbing states without planning — once Delivered is reached, the chain stops; models for ongoing churn need recurrent non-absorbing formulations.
- Overfitting HMM state count — too many hidden states fit noise; use BIC or held-out likelihood to pick K.
- Periodic chains — checking Pk for k = 1..10 reveals oscillation that averages hide.
Practical checklist
- Define states so the Markov property is approximately true — merge history into state if needed (e.g., "2 days in transit" vs "in transit").
- Verify each row of P sums to 1; clip negative estimates from noisy counts.
- Plot Pk for increasing k — check convergence vs periodicity.
- Compute stationary π and compare to empirical long-run frequencies as a sanity check.
- For hitting times (time to absorption), use linear systems or simulation with confidence intervals.
- When states are hidden, start with 2–3 HMM states before scaling complexity.
- Document non-stationarity triggers (season, product launch) for model refresh.
- Benchmark against naive baselines: last-state persistence, global average.
- For ML pipelines, export P and π — stakeholders often trust interpretable matrices over opaque scores.
- Link to domain simulators: if actions matter, upgrade chain to MDP and RL tooling.
Key takeaways
- Markov chains model sequences where only the current state predicts the next step.
- The transition matrix P encodes all one-step dynamics; powers of P give multi-step forecasts.
- Stationary distributions describe long-run behavior when the chain is ergodic.
- HMMs extend chains to hidden states with observable emissions — classic tool for sequence labeling.
- Markov structure underpins RL, PageRank, MCMC, and operational forecasting — a foundational bridge between probability and machine learning.
Related reading
- Reinforcement learning explained — Markov decision processes, rewards, and policies
- Bayesian inference explained — priors, posteriors, and MCMC sampling
- Time series forecasting explained — ARIMA, seasonality, and ordered prediction
- Multi-armed bandits explained — exploration vs exploitation without full MDP planning