Guide
Policy gradient methods explained
Harbor Logistics' warehouse AMRs followed a hand-tuned Q-table for aisle
selection: discretize pose into a 40×40 grid, learn action values offline,
deploy greedily. It worked until SKU layouts changed weekly. Re-tuning took days,
and the tabular policy could not express smooth trade-offs between tight turns and
long straightaways. The robotics team switched to policy gradient
training with PPO: a neural network maps lidar-rich observations
directly to a distribution over steering and speed commands. After three simulator
weeks and two days of safe on-floor fine-tuning, mean pick time dropped 19% with
fewer near-miss events. Policy gradient methods optimize a
parameterized policy πθ(a|s) by ascending the
gradient of expected return with respect to parameters θ. Unlike value-based
methods that learn Q(s,a) and derive actions indirectly, policy
gradients learn the behavior rule itself — natural for continuous control,
stochastic policies, and high-dimensional action spaces. This guide covers the
policy gradient theorem, REINFORCE, variance-reduction tricks, actor-critic
architectures, PPO, the Harbor refactor, a method decision table vs Q-learning
and planning, pitfalls, and a production checklist alongside our
reinforcement learning overview,
MDP foundations,
and
RLHF alignment guide.
Why optimize the policy directly?
Value-based methods like Q-learning estimate how good each action is, then act greedily or ε-greedily. That works when actions are discrete and few. Breakdowns appear when:
- Actions are continuous (joint torques, throttle, steering angle) — argmax over an infinite set needs a separate optimization each step.
- Stochastic policies help exploration — robotics and dialogue benefit from sampling, not deterministic argmax.
- The policy class is constrained — you want a smooth Gaussian over velocities, not a brittle lookup table.
- Action space is large or structured — language generation treats each token as an action; policy gradients underpin modern RLHF pipelines.
Policy gradients trade higher gradient variance for direct control over the policy shape. With baselines, critics, and trust-region updates, they scale to humanoid locomotion, StarCraft agents, and LLM fine-tuning.
The policy gradient theorem
Goal: maximize J(θ) = Eτ~πθ[ G0 ]
where τ is a trajectory and G0 is discounted return.
The policy gradient theorem (Sutton et al.) gives an unbiased gradient
estimator without differentiating through the environment transition model:
∇θ J(θ) = Eτ [ ∑t=0T ∇θ log πθ(at|st) · Gt ]
Intuition: increase log-probability of actions that preceded high returns; decrease those that preceded poor returns. The log-derivative trick moves the gradient inside the expectation over trajectories sampled from the current policy — the core of REINFORCE and its descendants.
Episodic vs continuing tasks
In episodic tasks, Gt is return-to-go from timestep
t. In continuing tasks, use discounted returns or average-reward
formulations. Harbor's pick episodes terminate when the tote is full or
the order list is empty (typically 90–240 seconds).
REINFORCE: Monte Carlo policy gradient
REINFORCE (Williams, 1992) is the vanilla algorithm:
- Sample full episodes using
πθ. - For each timestep
t, compute returnGt. - Update
θ ← θ + α ∇θ log πθ(at|st) Gt.
Simple and correct, but high variance: a lucky final reward upweights every early action equally. A single collision near the end can poison gradients for an otherwise good aisle approach. Production systems rarely stop at vanilla REINFORCE; they add baselines and critics.
Reward-to-go variant
Use return from t onward only, not the full episode return
at every step. This removes variance from rewards already collected before
t and is standard practice.
Variance reduction: baselines and advantages
Subtract a baseline b(st) that does not depend on the
sampled action — the gradient remains unbiased if b is action-independent:
∇θ J ≈ E [ ∑t ∇θ log πθ(at|st) (Gt - b(st)) ]
A common baseline is the state-value estimate
Vπ(s). Define the advantage:
Aπ(s,a) = Qπ(s,a) - Vπ(s)
— how much better action a is than average in state
s. Policy updates use ∇ log π(a|s) · A(s,a),
which lowers variance dramatically. See
MDPs and value functions
for the Bellman foundations behind V and Q.
Generalized Advantage Estimation (GAE)
GAE blends multi-step returns with a trace parameter
λ to bias-variance-tradeoff advantage estimates. PPO and
A2C implementations almost always use GAE with λ ∈ [0.9, 0.98]
rather than single-step TD errors alone.
Actor-critic architectures
An actor-critic splits roles:
- Actor — parameterized policy
πθ(a|s). - Critic — value function
Vφ(s)orQφ(s,a)estimating returns for baseline and advantage.
The critic is trained with TD or Monte Carlo regression; the actor uses the critic's advantage signal. Updates can alternate or run in parallel.
A2C and A3C
Advantage Actor-Critic (A2C) synchronously rolls out multiple workers and averages gradients. A3C (asynchronous) uses Hogwild-style parallel actors with stale parameters — historically popular, largely superseded by PPO with vectorized envs on GPUs.
Continuous control: Gaussian policies
For continuous actions, the actor outputs mean and log-standard-deviation of a Gaussian (or squashed Gaussian via tanh for bounded actions). Log-probabilities are tractable; reparameterization is not required for REINFORCE-style updates but appears in off-policy actor-critic methods like SAC.
PPO: stable on-policy updates at scale
Proximal Policy Optimization (PPO) (Schulman et al., 2017) is the default on-policy workhorse. It maximizes a clipped surrogate objective that prevents destructively large policy updates:
LCLIP(θ) = E [ min( rt(θ) Ât, clip(rt(θ), 1-ε, 1+ε) Ât ) ]
where rt(θ) = πθ(at|st) / πθold(at|st)
is the probability ratio and Ât is the advantage estimate.
Clipping with ε ≈ 0.1–0.2 keeps new and old policies
close while still improving reward. PPO is simpler than TRPO (trust region with
KL constraints) and empirically robust across robotics, games, and
LLM reasoning fine-tuning variants.
Practical PPO hyperparameters
- Rollout length: 128–2048 steps per env before update.
- Epochs per batch: 3–10 passes over the same data (on-policy reuse).
- Entropy bonus: encourages exploration; decay as policy matures.
- Value loss coefficient: typically 0.5; clip value targets like policy for stability.
- Gradient clipping: global norm 0.5–1.0 prevents critic spikes from destabilizing the actor.
On-policy vs off-policy policy gradients
On-policy methods (REINFORCE, A2C, PPO) learn from data generated by the current policy. Sample efficiency is lower but stability is higher. Off-policy actor-critics (DDPG, TD3, SAC) reuse replay buffers and learn from behavior older than the current policy — better sample efficiency for expensive simulators, but more hyperparameter sensitivity. Harbor chose on-policy PPO because the simulator was cheap and safety reviewers wanted a clear bound on how far each deploy could drift from the last validated checkpoint.
Connection to bandits and planning
The single-state bandit is a degenerate MDP; policy gradient reduces to weighted log-likelihood updates on arms. For lookahead planning without gradient steps, see Monte Carlo tree search and the stateless multi-armed bandits guide.
Harbor Logistics: AMR pick-path policy refactor
Harbor's mid-size fulfillment center runs eight autonomous mobile robots on a 1.2 km² grid with dynamic pick lists. The legacy pipeline:
- Discretize pose (x, y, heading) into 40×40×8 bins.
- Offline Q-learning on a simplified simulator (no other robots).
- Deploy greedy Q with a static obstacle map.
Failures clustered around layout changes, bidirectional aisle sharing, and non-Markovian congestion (other robots not in state). Retraining the Q-table after each layout tweak took 2–4 engineer-days.
The replacement PPO stack:
- Observation: 72-dim vector — local occupancy grid (48), goal-relative pose (8), remaining tote slots (4), time-to-deadline scalar (1), neighbor robot relative positions (11).
- Action: continuous forward speed and yaw rate, squashed to warehouse limits.
- Reward: +1 per successful pick, −0.01 per second, −5 near-collision (lidar threshold), −20 actual collision, +3 early completion bonus.
- Training: 512 parallel Isaac-style sim envs, GAE λ=0.95, clip ε=0.15, 4M steps to convergence.
- Deploy: ONNX actor on robot; critic dropped; two-day on-floor fine-tune with safety wrapper that overrides if lidar clearance < 30 cm.
Results over four weeks live: mean pick cycle −19%, near-miss rate −34%, layout-change revalidation dropped from days to re-running a fixed 50k-step fine-tune overnight. The team still maintains a model-based particle filter for localization; policy gradients handle where to go next, not pose estimation.
Method decision table
| Approach | Best for | Tradeoff |
|---|---|---|
| REINFORCE / reward-to-go | Teaching, small discrete envs, proof-of-concept | High variance; slow convergence |
| Actor-critic (A2C) | Moderate-scale discrete/continuous control | Sensitive to critic bias; superseded by PPO in most cases |
| PPO | Robotics sim2real, games, on-policy LLM RL | Sample-inefficient vs off-policy; needs many parallel envs |
| SAC / TD3 | Expensive real-world steps, continuous control | Off-policy instability; harder safety certification |
| DQN / Q-learning | Discrete actions, Atari-style benchmarks | Awkward for continuous or stochastic policies |
| MCTS + policy prior | Perfect-information games, planning at inference | Compute-heavy online; not end-to-end learned control |
Common pitfalls
- Ignoring reward scale: unnormalized returns explode critic loss; standardize advantages per batch.
- Sparse rewards without shaping: policy gradient needs signal; add dense proxies (distance-to-goal) then anneal.
- Non-stationary data in on-policy loops: PPO reuses rollouts for multiple epochs — too many epochs overfits stale data.
- Critic lag: if
V(s)is wrong, advantages mis-rank actions; tune critic learning rate ≥ actor rate. - Entropy collapse: policy becomes deterministic too early; monitor entropy and use bonus decay schedules.
- Sim-to-real gap: randomize friction, sensor noise, and delays in sim; Harbor's safety wrapper is non-negotiable on floor.
- Violating Markov state: omitting other agents or hidden inventory states makes “optimal” policies brittle — enrich observations or use centralized training.
- Deploying without action bounds: Gaussian tails can command illegal speeds; squash or clip at inference.
Practitioner checklist
- Define MDP tuple explicitly: state, action, reward, episode termination (see MDP guide).
- Start with reward-to-go REINFORCE on a toy env to verify gradient sign before scaling.
- Add value baseline or critic before training on real hardware.
- Use GAE for advantage estimation in PPO/A2C implementations.
- Log policy entropy, KL to previous policy, value loss, and episode return percentiles.
- Vectorize environments; target 10k–100k steps per PPO update on GPU.
- Checkpoint actor and critic separately; export actor-only for deployment.
- Wrap live policies with hard safety overrides independent of learned actions.
- Version sim assets with policy checkpoints for reproducible revalidation.
- Compare against strong baselines (A*, heuristic controller) before claiming RL wins.
Key takeaways
- Policy gradients optimize the behavior rule directly via the log-probability trick — essential for continuous and stochastic control.
- Vanilla REINFORCE is a teaching tool; production stacks use baselines, critics, and clipped objectives (PPO).
- Advantage estimation and GAE are the main levers for variance reduction and stable learning.
- On-policy PPO trades sample efficiency for stability; off-policy SAC wins when real-world samples are expensive.
- Harbor-style deployments pair sim-trained actors with safety wrappers and localization systems that stay outside the RL loop.
Related reading
- Reinforcement learning explained — MDPs, exploration, Q-learning, and RLHF landscape
- Markov decision processes explained — Bellman equations, value iteration, and optimal policies
- RLHF explained — policy optimization for aligning language models with human preferences
- Monte Carlo tree search explained — planning with learned policy priors