Guide

Markov decision processes explained

Harbor Courier routed same-day parcels with a static shortest-path engine. It ignored traffic surges, driver fatigue windows, and the asymmetric cost of a missed dinner-slot delivery versus an extra two minutes on a low-priority stop. On-time performance plateaued at 91% despite adding vehicles. The ops team reframed routing as a Markov decision process (MDP): at each intersection the dispatcher chooses an action (which stop next, which arterial to take) based on a compact state (vehicle location, remaining capacity, time-to-deadline per parcel, traffic regime), earns a reward that penalizes lateness and fuel, and transitions stochastically to the next state. Value iteration on a discretized model lifted on-time rate to 97% with the same fleet. An MDP is the standard mathematical scaffold for sequential decisions under uncertainty: you specify states, actions, transition probabilities, rewards, and a discount factor, then ask which policy maximizes expected cumulative reward. This guide builds the MDP tuple from the Markov property through Bellman equations, value and policy iteration, partial observability, works the Harbor routing case, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For the broader RL algorithm landscape built on MDPs, see reinforcement learning; for memoryless state-only models without actions, see Markov chains.

The MDP tuple: (S, A, P, R, γ)

A finite MDP is defined by five objects:

State space S: everything the decision-maker needs to act optimally. Harbor uses grid cells plus a discretized “urgency bucket” per remaining stop (on-time / at-risk / late-if-delayed).
Action space A(s): feasible moves from state s — which adjacent cell to enter, which parcel to service next, whether to wait for a traffic signal cycle.
Transition kernel P(s' | s, a): probability of landing in s' after taking action a in s. Stochasticity captures traffic lights, variable travel times, and customer not-at-home events.
Reward function R(s, a, s'): immediate payoff (or cost) of the transition. Harbor uses negative rewards: fuel cost, lateness penalties scaled by parcel priority, small bonuses for early completion.
Discount factor γ ∈ [0, 1): weights future rewards relative to immediate ones. Near 1 for long-horizon logistics; lower when you care mostly about the next few decisions.

A trajectory is a sequence s₀, a₀, r₁, s₁, a₁, …. The return is the discounted sum of rewards: G_t = ∑_k=0^∞ γ^k R_t+k+1. The agent's objective is to maximize expected return from every reachable state.

The Markov property

P(s_t+1 | s_t, a_t, s_t-1, …) = P(s_t+1 | s_t, a_t) — the future depends on the past only through the current state. If your state vector omits information (e.g. hidden traffic upstream), the property is violated and you need a POMDP (partially observable MDP) or a richer state. Violating Markovness silently makes “optimal” policies suboptimal in the real world.

Policies and value functions

A policy π(a | s) maps states to action distributions. Deterministic policies pick one action per state; stochastic policies randomize (useful for exploration in learning settings).

State-value function V^π(s)

Expected return starting in s and following π thereafter: V^π(s) = E_π[ G_t | S_t = s ]. Optimal value V*(s) = max_π V^π(s).

Action-value function Q^π(s, a)

Expected return after taking a in s, then following π: Q^π(s, a) = E_π[ G_t | S_t = s, A_t = a ]. The link: V^π(s) = ∑_a π(a|s) Q^π(s,a). An optimal deterministic policy satisfies π*(s) = argmax_a Q*(s, a).

Advantage A^π(s,a) = Q^π(s,a) - V^π(s) measures how much better action a is than the policy average — central to actor-critic and PPO methods in deep RL.

Bellman equations and optimality

Value functions satisfy recursive Bellman equations. For a fixed policy:

V^π(s) = ∑_a π(a|s) ∑_s' P(s'|s,a) [ R(s,a,s') + γ V^π(s') ]

Optimality breaks the recursion with a max over actions (Bellman optimality equation):

V*(s) = max_a ∑_s' P(s'|s,a) [ R(s,a,s') + γ V*(s') ]

The Q-version is analogous and often more convenient when action spaces differ per state. These equations are the bridge between model-based dynamic programming (you know P and R) and model-free learning (you estimate Q from samples).

Discount factor interpretation

γ → 1: patient agent; late penalties far in the future still matter. Use for long routes and inventory problems.
Lower γ: myopic agent; good when the model horizon is unreliable or episodes are short.
Effective horizon: roughly 1 / (1 - γ) steps; γ = 0.99 implies ~100-step planning horizon.

Solving MDPs: value iteration and policy iteration

When S and A are small enough to tabulate, classic algorithms find π* exactly (up to numerical tolerance).

Value iteration

Repeatedly apply the Bellman optimality operator as an update: V_k+1(s) ← max_a ∑_s' P(s'|s,a)[R + γ V_k(s')] until convergence. Extract π(s) = argmax_a ∑ P(s'|s,a)[R + γ V(s')] at the end. Each sweep is O(|S||A||S|); converges exponentially under contraction.

Policy iteration

Alternate policy evaluation (solve linear system for V^π) with policy improvement (greedy update). Often fewer iterations than value iteration, but each evaluation step can be expensive for large |S|.

When tabular methods break

Continuous states (robot joint angles), huge combinatorial spaces (Go board), or unknown P push you to function approximation, Monte Carlo tree search, or deep RL. See Monte Carlo tree search for planning without full models and multi-armed bandits for the stateless single-step special case.

Finite vs infinite horizon and average reward

Finite-horizon MDPs add a terminal time T; values become time-dependent V_t(s). Infinite-horizon discounted formulations (above) are the RL default. Average-reward criteria maximize long-run reward per step when discounting is unnatural (factory control, steady-state queueing). Pick the criterion to match how stakeholders score success — mixing them in one project causes policy arguments, not better routes.

Partial observability (POMDPs)

When the agent sees only a noisy observation o instead of true state, the problem becomes a POMDP. Optimal policies map belief states (distributions over S) to actions — exponentially harder. Harbor sidesteps full POMDP solvers by enriching the MDP state with recent traffic API readings and a short velocity history, keeping the model tractable.

Harbor Courier: same-day routing as an MDP

Harbor's metro same-day fleet serves 180 stops per evening shift across eight vans. The legacy engine minimized sum of edge weights from a static graph; it could not express “this pharmacy stop tolerates 8 minutes of slack but the meal-kit stop does not.”

The team built a discretized MDP per van per shift:

States: current cell on a 200 m grid, remaining capacity bucket (0 / 1–3 / 4+ parcels), and per-stop urgency class for the next three candidate stops.
Actions: choose which eligible stop to visit next, or which of two arterial corridors to enter.
Transitions: estimated from six months of GPS traces — empirical P(s'|s,a) for travel time bins and success/fail service events.
Rewards: +2 on-time completion, −15 per minute late beyond grace window, −0.3 per km fuel proxy, −5 for capacity violation attempts.
γ = 0.98 reflecting a 4-hour shift horizon.

Value iteration on ~12,000 states per van (precomputed offline) yields a lookup policy deployed on tablets. Drivers still see turn-by-turn maps, but the stop sequencing changed: the policy sometimes accepts a longer leg early to protect a tight window later — behavior impossible for static shortest path. A/B over three weeks: on-time rate 91.2% → 97.1%, fuel per stop −4%, driver overtime −11%. Offline model refresh runs weekly; if traffic non-stationarity spikes (holiday surge), they temporarily raise exploration noise in a shadow RL fine-tuning lane before promoting an updated P table.

Method decision table

Problem shape	Tabular MDP (VI / PI)	Markov chain (no actions)	Multi-armed bandit	Deep RL / MCTS
Small discrete state, known transitions	Strong fit	Wrong tool (no decisions)	Wrong tool (no state)	Overkill
Sequential decisions, stochastic dynamics	Strong if \|S\| modest	Insufficient	Insufficient	When \|S\| huge or P unknown
Single decision per episode (ad slot, A/B arm)	Overkill	Moderate (stateless)	Strong fit	Overkill
Only aggregate state frequencies matter	Optional	Strong fit	N/A	N/A
Perfect information games, huge branching	Infeasible	N/A	N/A	MCTS + policy net
Unknown model, expensive simulator	Need estimated P	N/A	N/A	Model-free RL

Common pitfalls

State too small: omitting deadline pressure or hidden traffic breaks the Markov property; policies look optimal on paper and fail in production.
State too large: curse of dimensionality makes value iteration impossible; discretization must be purposeful, not naive per-pixel.
Misspecified rewards: optimizing the wrong R yields perverse policies (skipping hard stops, hoarding easy ones). Align rewards with business KPIs and audit edge cases.
Stale transition model: empirical P from last quarter's traffic mis-predicts holiday behavior; schedule model refresh and drift monitors.
Discount mismatch: using γ = 0.9 on a problem stakeholders evaluate monthly undervalues late penalties.
Ignoring partial observability: treating POMDPs as MDPs without belief state causes oscillation (repeatedly revisiting wrong corridors).
Skipping simulator validation: deploy policies through shadow mode or historical replay before touching live drivers or trading capital.

Practitioner checklist

Define the decision epochs (per stop, per minute, per trade) and episode boundaries.
List state variables that render the future independent of the past; add hidden factors or move to POMDP if needed.
Specify action constraints per state (capacity, legal moves, risk limits).
Estimate or simulate P(s'|s,a); document data window and non-stationarity risks.
Design R with stakeholders; test reward hacking scenarios on paper.
Choose discount or average-reward criterion matching the business horizon.
Solve with value iteration, policy iteration, or linear programming for small models.
Validate policy via simulation, historical replay, and small live A/B before full rollout.
Monitor KPI drift; trigger model refresh when transition errors or value residuals spike.
Document state abstraction, γ, and policy extraction for audit and handoff to RL teams if scaling up.

Key takeaways

An MDP formalizes sequential decisions with the tuple (S, A, P, R, γ) and the Markov property on states.
Policies induce value functions V^π and Q^π; optimal policies maximize expected discounted return via Bellman optimality.
Value and policy iteration solve small tabular MDPs exactly when the transition model is known.
State design and reward specification dominate engineering effort; wrong abstractions cannot be fixed by a better solver.
Large or unknown models hand off to MCTS, function approximation, and deep RL — but the MDP framing remains the blueprint.