Guide
Markov decision processes explained
Harbor Courier routed same-day parcels with a static shortest-path engine. It ignored traffic surges, driver fatigue windows, and the asymmetric cost of a missed dinner-slot delivery versus an extra two minutes on a low-priority stop. On-time performance plateaued at 91% despite adding vehicles. The ops team reframed routing as a Markov decision process (MDP): at each intersection the dispatcher chooses an action (which stop next, which arterial to take) based on a compact state (vehicle location, remaining capacity, time-to-deadline per parcel, traffic regime), earns a reward that penalizes lateness and fuel, and transitions stochastically to the next state. Value iteration on a discretized model lifted on-time rate to 97% with the same fleet. An MDP is the standard mathematical scaffold for sequential decisions under uncertainty: you specify states, actions, transition probabilities, rewards, and a discount factor, then ask which policy maximizes expected cumulative reward. This guide builds the MDP tuple from the Markov property through Bellman equations, value and policy iteration, partial observability, works the Harbor routing case, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For the broader RL algorithm landscape built on MDPs, see reinforcement learning; for memoryless state-only models without actions, see Markov chains.
The MDP tuple: (S, A, P, R, γ)
A finite MDP is defined by five objects:
- State space S: everything the decision-maker needs to act optimally. Harbor uses grid cells plus a discretized “urgency bucket” per remaining stop (on-time / at-risk / late-if-delayed).
- Action space A(s): feasible moves from state
s— which adjacent cell to enter, which parcel to service next, whether to wait for a traffic signal cycle. - Transition kernel P(s' | s, a): probability of landing in
s'after taking actionains. Stochasticity captures traffic lights, variable travel times, and customer not-at-home events. - Reward function R(s, a, s'): immediate payoff (or cost) of the transition. Harbor uses negative rewards: fuel cost, lateness penalties scaled by parcel priority, small bonuses for early completion.
- Discount factor γ ∈ [0, 1): weights future rewards relative to immediate ones. Near 1 for long-horizon logistics; lower when you care mostly about the next few decisions.
A trajectory is a sequence
s0, a0, r1, s1, a1, ….
The return is the discounted sum of rewards:
Gt = ∑k=0∞ γk Rt+k+1.
The agent's objective is to maximize expected return from every reachable state.
The Markov property
P(st+1 | st, at, st-1, …) = P(st+1 | st, at)
— the future depends on the past only through the current state. If your
state vector omits information (e.g. hidden traffic upstream), the property is
violated and you need a POMDP (partially observable MDP) or a
richer state. Violating Markovness silently makes “optimal” policies
suboptimal in the real world.
Policies and value functions
A policy π(a | s) maps states to action
distributions. Deterministic policies pick one action per state; stochastic
policies randomize (useful for exploration in learning settings).
State-value function Vπ(s)
Expected return starting in s and following π thereafter:
Vπ(s) = Eπ[ Gt | St = s ].
Optimal value V*(s) = maxπ Vπ(s).
Action-value function Qπ(s, a)
Expected return after taking a in s, then following
π:
Qπ(s, a) = Eπ[ Gt | St = s, At = a ].
The link: Vπ(s) = ∑a π(a|s) Qπ(s,a).
An optimal deterministic policy satisfies
π*(s) = argmaxa Q*(s, a).
Advantage Aπ(s,a) = Qπ(s,a) - Vπ(s)
measures how much better action a is than the policy average —
central to actor-critic and PPO methods in
deep RL.
Bellman equations and optimality
Value functions satisfy recursive Bellman equations. For a fixed policy:
Vπ(s) = ∑a π(a|s) ∑s' P(s'|s,a) [ R(s,a,s') + γ Vπ(s') ]
Optimality breaks the recursion with a max over actions (Bellman
optimality equation):
V*(s) = maxa ∑s' P(s'|s,a) [ R(s,a,s') + γ V*(s') ]
The Q-version is analogous and often more convenient when action spaces differ
per state. These equations are the bridge between model-based dynamic
programming (you know P and R) and model-free
learning (you estimate Q from samples).
Discount factor interpretation
- γ → 1: patient agent; late penalties far in the future still matter. Use for long routes and inventory problems.
- Lower γ: myopic agent; good when the model horizon is unreliable or episodes are short.
- Effective horizon: roughly
1 / (1 - γ)steps;γ = 0.99implies ~100-step planning horizon.
Solving MDPs: value iteration and policy iteration
When S and A are small enough to tabulate, classic
algorithms find π* exactly (up to numerical tolerance).
Value iteration
Repeatedly apply the Bellman optimality operator as an update:
Vk+1(s) ← maxa ∑s' P(s'|s,a)[R + γ Vk(s')]
until convergence. Extract
π(s) = argmaxa ∑ P(s'|s,a)[R + γ V(s')]
at the end. Each sweep is O(|S||A||S|); converges exponentially under contraction.
Policy iteration
Alternate policy evaluation (solve linear system for
Vπ) with policy improvement
(greedy update). Often fewer iterations than value iteration, but each
evaluation step can be expensive for large |S|.
When tabular methods break
Continuous states (robot joint angles), huge combinatorial spaces (Go board),
or unknown P push you to function approximation, Monte Carlo tree
search, or deep RL. See
Monte Carlo tree search
for planning without full models and
multi-armed bandits
for the stateless single-step special case.
Finite vs infinite horizon and average reward
Finite-horizon MDPs add a terminal time T; values
become time-dependent Vt(s). Infinite-horizon
discounted formulations (above) are the RL default. Average-reward
criteria maximize long-run reward per step when discounting is unnatural (factory
control, steady-state queueing). Pick the criterion to match how stakeholders
score success — mixing them in one project causes policy arguments, not
better routes.
Partial observability (POMDPs)
When the agent sees only a noisy observation o
instead of true state, the problem becomes a POMDP. Optimal policies map
belief states (distributions over S) to actions —
exponentially harder. Harbor sidesteps full POMDP solvers by enriching the MDP
state with recent traffic API readings and a short velocity history, keeping
the model tractable.
Harbor Courier: same-day routing as an MDP
Harbor's metro same-day fleet serves 180 stops per evening shift across eight vans. The legacy engine minimized sum of edge weights from a static graph; it could not express “this pharmacy stop tolerates 8 minutes of slack but the meal-kit stop does not.”
The team built a discretized MDP per van per shift:
- States: current cell on a 200 m grid, remaining capacity bucket (0 / 1–3 / 4+ parcels), and per-stop urgency class for the next three candidate stops.
- Actions: choose which eligible stop to visit next, or which of two arterial corridors to enter.
- Transitions: estimated from six months of GPS traces — empirical
P(s'|s,a)for travel time bins and success/fail service events. - Rewards: +2 on-time completion, −15 per minute late beyond grace window, −0.3 per km fuel proxy, −5 for capacity violation attempts.
- γ = 0.98 reflecting a 4-hour shift horizon.
Value iteration on ~12,000 states per van (precomputed offline)
yields a lookup policy deployed on tablets. Drivers still see turn-by-turn maps,
but the stop sequencing changed: the policy sometimes accepts a longer
leg early to protect a tight window later — behavior impossible for static
shortest path. A/B over three weeks: on-time rate 91.2% → 97.1%, fuel per
stop −4%, driver overtime −11%. Offline model refresh runs weekly;
if traffic non-stationarity spikes (holiday surge), they temporarily raise
exploration noise in a shadow RL fine-tuning lane before promoting an updated
P table.
Method decision table
| Problem shape | Tabular MDP (VI / PI) | Markov chain (no actions) | Multi-armed bandit | Deep RL / MCTS |
|---|---|---|---|---|
| Small discrete state, known transitions | Strong fit | Wrong tool (no decisions) | Wrong tool (no state) | Overkill |
| Sequential decisions, stochastic dynamics | Strong if |S| modest | Insufficient | Insufficient | When |S| huge or P unknown |
| Single decision per episode (ad slot, A/B arm) | Overkill | Moderate (stateless) | Strong fit | Overkill |
| Only aggregate state frequencies matter | Optional | Strong fit | N/A | N/A |
| Perfect information games, huge branching | Infeasible | N/A | N/A | MCTS + policy net |
| Unknown model, expensive simulator | Need estimated P | N/A | N/A | Model-free RL |
Common pitfalls
- State too small: omitting deadline pressure or hidden traffic breaks the Markov property; policies look optimal on paper and fail in production.
- State too large: curse of dimensionality makes value iteration impossible; discretization must be purposeful, not naive per-pixel.
- Misspecified rewards: optimizing the wrong
Ryields perverse policies (skipping hard stops, hoarding easy ones). Align rewards with business KPIs and audit edge cases. - Stale transition model: empirical
Pfrom last quarter's traffic mis-predicts holiday behavior; schedule model refresh and drift monitors. - Discount mismatch: using
γ = 0.9on a problem stakeholders evaluate monthly undervalues late penalties. - Ignoring partial observability: treating POMDPs as MDPs without belief state causes oscillation (repeatedly revisiting wrong corridors).
- Skipping simulator validation: deploy policies through shadow mode or historical replay before touching live drivers or trading capital.
Practitioner checklist
- Define the decision epochs (per stop, per minute, per trade) and episode boundaries.
- List state variables that render the future independent of the past; add hidden factors or move to POMDP if needed.
- Specify action constraints per state (capacity, legal moves, risk limits).
- Estimate or simulate
P(s'|s,a); document data window and non-stationarity risks. - Design
Rwith stakeholders; test reward hacking scenarios on paper. - Choose discount or average-reward criterion matching the business horizon.
- Solve with value iteration, policy iteration, or linear programming for small models.
- Validate policy via simulation, historical replay, and small live A/B before full rollout.
- Monitor KPI drift; trigger model refresh when transition errors or value residuals spike.
- Document state abstraction,
γ, and policy extraction for audit and handoff to RL teams if scaling up.
Key takeaways
- An MDP formalizes sequential decisions with the tuple (S, A, P, R, γ) and the Markov property on states.
- Policies induce value functions Vπ and Qπ; optimal policies maximize expected discounted return via Bellman optimality.
- Value and policy iteration solve small tabular MDPs exactly when the transition model is known.
- State design and reward specification dominate engineering effort; wrong abstractions cannot be fixed by a better solver.
- Large or unknown models hand off to MCTS, function approximation, and deep RL — but the MDP framing remains the blueprint.
Related reading
- Markov chains explained — transition matrices, stationary distributions and HMMs without actions
- Reinforcement learning explained — Q-learning, policy gradients, PPO and RLHF built on MDPs
- Monte Carlo tree search explained — planning with rollouts when the tree is too large to tabulate
- Multi-armed bandits explained — the single-state MDP special case for exploration