Guide

Policy gradient methods explained

Harbor Logistics' warehouse AMRs followed a hand-tuned Q-table for aisle selection: discretize pose into a 40×40 grid, learn action values offline, deploy greedily. It worked until SKU layouts changed weekly. Re-tuning took days, and the tabular policy could not express smooth trade-offs between tight turns and long straightaways. The robotics team switched to policy gradient training with PPO: a neural network maps lidar-rich observations directly to a distribution over steering and speed commands. After three simulator weeks and two days of safe on-floor fine-tuning, mean pick time dropped 19% with fewer near-miss events. Policy gradient methods optimize a parameterized policy πθ(a|s) by ascending the gradient of expected return with respect to parameters θ. Unlike value-based methods that learn Q(s,a) and derive actions indirectly, policy gradients learn the behavior rule itself — natural for continuous control, stochastic policies, and high-dimensional action spaces. This guide covers the policy gradient theorem, REINFORCE, variance-reduction tricks, actor-critic architectures, PPO, the Harbor refactor, a method decision table vs Q-learning and planning, pitfalls, and a production checklist alongside our reinforcement learning overview, MDP foundations, and RLHF alignment guide.

Why optimize the policy directly?

Value-based methods like Q-learning estimate how good each action is, then act greedily or ε-greedily. That works when actions are discrete and few. Breakdowns appear when:

  • Actions are continuous (joint torques, throttle, steering angle) — argmax over an infinite set needs a separate optimization each step.
  • Stochastic policies help exploration — robotics and dialogue benefit from sampling, not deterministic argmax.
  • The policy class is constrained — you want a smooth Gaussian over velocities, not a brittle lookup table.
  • Action space is large or structured — language generation treats each token as an action; policy gradients underpin modern RLHF pipelines.

Policy gradients trade higher gradient variance for direct control over the policy shape. With baselines, critics, and trust-region updates, they scale to humanoid locomotion, StarCraft agents, and LLM fine-tuning.

The policy gradient theorem

Goal: maximize J(θ) = Eτ~πθ[ G0 ] where τ is a trajectory and G0 is discounted return. The policy gradient theorem (Sutton et al.) gives an unbiased gradient estimator without differentiating through the environment transition model:

θ J(θ) = Eτ [ ∑t=0Tθ log πθ(at|st) · Gt ]

Intuition: increase log-probability of actions that preceded high returns; decrease those that preceded poor returns. The log-derivative trick moves the gradient inside the expectation over trajectories sampled from the current policy — the core of REINFORCE and its descendants.

Episodic vs continuing tasks

In episodic tasks, Gt is return-to-go from timestep t. In continuing tasks, use discounted returns or average-reward formulations. Harbor's pick episodes terminate when the tote is full or the order list is empty (typically 90–240 seconds).

REINFORCE: Monte Carlo policy gradient

REINFORCE (Williams, 1992) is the vanilla algorithm:

  1. Sample full episodes using πθ.
  2. For each timestep t, compute return Gt.
  3. Update θ ← θ + α ∇θ log πθ(at|st) Gt.

Simple and correct, but high variance: a lucky final reward upweights every early action equally. A single collision near the end can poison gradients for an otherwise good aisle approach. Production systems rarely stop at vanilla REINFORCE; they add baselines and critics.

Reward-to-go variant

Use return from t onward only, not the full episode return at every step. This removes variance from rewards already collected before t and is standard practice.

Variance reduction: baselines and advantages

Subtract a baseline b(st) that does not depend on the sampled action — the gradient remains unbiased if b is action-independent:

θ J ≈ E [ ∑tθ log πθ(at|st) (Gt - b(st)) ]

A common baseline is the state-value estimate Vπ(s). Define the advantage: Aπ(s,a) = Qπ(s,a) - Vπ(s) — how much better action a is than average in state s. Policy updates use ∇ log π(a|s) · A(s,a), which lowers variance dramatically. See MDPs and value functions for the Bellman foundations behind V and Q.

Generalized Advantage Estimation (GAE)

GAE blends multi-step returns with a trace parameter λ to bias-variance-tradeoff advantage estimates. PPO and A2C implementations almost always use GAE with λ ∈ [0.9, 0.98] rather than single-step TD errors alone.

Actor-critic architectures

An actor-critic splits roles:

  • Actor — parameterized policy πθ(a|s).
  • Critic — value function Vφ(s) or Qφ(s,a) estimating returns for baseline and advantage.

The critic is trained with TD or Monte Carlo regression; the actor uses the critic's advantage signal. Updates can alternate or run in parallel.

A2C and A3C

Advantage Actor-Critic (A2C) synchronously rolls out multiple workers and averages gradients. A3C (asynchronous) uses Hogwild-style parallel actors with stale parameters — historically popular, largely superseded by PPO with vectorized envs on GPUs.

Continuous control: Gaussian policies

For continuous actions, the actor outputs mean and log-standard-deviation of a Gaussian (or squashed Gaussian via tanh for bounded actions). Log-probabilities are tractable; reparameterization is not required for REINFORCE-style updates but appears in off-policy actor-critic methods like SAC.

PPO: stable on-policy updates at scale

Proximal Policy Optimization (PPO) (Schulman et al., 2017) is the default on-policy workhorse. It maximizes a clipped surrogate objective that prevents destructively large policy updates:

LCLIP(θ) = E [ min( rt(θ) Ât, clip(rt(θ), 1-ε, 1+ε) Ât ) ]

where rt(θ) = πθ(at|st) / πθold(at|st) is the probability ratio and Ât is the advantage estimate. Clipping with ε ≈ 0.1–0.2 keeps new and old policies close while still improving reward. PPO is simpler than TRPO (trust region with KL constraints) and empirically robust across robotics, games, and LLM reasoning fine-tuning variants.

Practical PPO hyperparameters

  • Rollout length: 128–2048 steps per env before update.
  • Epochs per batch: 3–10 passes over the same data (on-policy reuse).
  • Entropy bonus: encourages exploration; decay as policy matures.
  • Value loss coefficient: typically 0.5; clip value targets like policy for stability.
  • Gradient clipping: global norm 0.5–1.0 prevents critic spikes from destabilizing the actor.

On-policy vs off-policy policy gradients

On-policy methods (REINFORCE, A2C, PPO) learn from data generated by the current policy. Sample efficiency is lower but stability is higher. Off-policy actor-critics (DDPG, TD3, SAC) reuse replay buffers and learn from behavior older than the current policy — better sample efficiency for expensive simulators, but more hyperparameter sensitivity. Harbor chose on-policy PPO because the simulator was cheap and safety reviewers wanted a clear bound on how far each deploy could drift from the last validated checkpoint.

Connection to bandits and planning

The single-state bandit is a degenerate MDP; policy gradient reduces to weighted log-likelihood updates on arms. For lookahead planning without gradient steps, see Monte Carlo tree search and the stateless multi-armed bandits guide.

Harbor Logistics: AMR pick-path policy refactor

Harbor's mid-size fulfillment center runs eight autonomous mobile robots on a 1.2 km² grid with dynamic pick lists. The legacy pipeline:

  1. Discretize pose (x, y, heading) into 40×40×8 bins.
  2. Offline Q-learning on a simplified simulator (no other robots).
  3. Deploy greedy Q with a static obstacle map.

Failures clustered around layout changes, bidirectional aisle sharing, and non-Markovian congestion (other robots not in state). Retraining the Q-table after each layout tweak took 2–4 engineer-days.

The replacement PPO stack:

  • Observation: 72-dim vector — local occupancy grid (48), goal-relative pose (8), remaining tote slots (4), time-to-deadline scalar (1), neighbor robot relative positions (11).
  • Action: continuous forward speed and yaw rate, squashed to warehouse limits.
  • Reward: +1 per successful pick, −0.01 per second, −5 near-collision (lidar threshold), −20 actual collision, +3 early completion bonus.
  • Training: 512 parallel Isaac-style sim envs, GAE λ=0.95, clip ε=0.15, 4M steps to convergence.
  • Deploy: ONNX actor on robot; critic dropped; two-day on-floor fine-tune with safety wrapper that overrides if lidar clearance < 30 cm.

Results over four weeks live: mean pick cycle −19%, near-miss rate −34%, layout-change revalidation dropped from days to re-running a fixed 50k-step fine-tune overnight. The team still maintains a model-based particle filter for localization; policy gradients handle where to go next, not pose estimation.

Method decision table

Approach Best for Tradeoff
REINFORCE / reward-to-go Teaching, small discrete envs, proof-of-concept High variance; slow convergence
Actor-critic (A2C) Moderate-scale discrete/continuous control Sensitive to critic bias; superseded by PPO in most cases
PPO Robotics sim2real, games, on-policy LLM RL Sample-inefficient vs off-policy; needs many parallel envs
SAC / TD3 Expensive real-world steps, continuous control Off-policy instability; harder safety certification
DQN / Q-learning Discrete actions, Atari-style benchmarks Awkward for continuous or stochastic policies
MCTS + policy prior Perfect-information games, planning at inference Compute-heavy online; not end-to-end learned control

Common pitfalls

  • Ignoring reward scale: unnormalized returns explode critic loss; standardize advantages per batch.
  • Sparse rewards without shaping: policy gradient needs signal; add dense proxies (distance-to-goal) then anneal.
  • Non-stationary data in on-policy loops: PPO reuses rollouts for multiple epochs — too many epochs overfits stale data.
  • Critic lag: if V(s) is wrong, advantages mis-rank actions; tune critic learning rate ≥ actor rate.
  • Entropy collapse: policy becomes deterministic too early; monitor entropy and use bonus decay schedules.
  • Sim-to-real gap: randomize friction, sensor noise, and delays in sim; Harbor's safety wrapper is non-negotiable on floor.
  • Violating Markov state: omitting other agents or hidden inventory states makes “optimal” policies brittle — enrich observations or use centralized training.
  • Deploying without action bounds: Gaussian tails can command illegal speeds; squash or clip at inference.

Practitioner checklist

  • Define MDP tuple explicitly: state, action, reward, episode termination (see MDP guide).
  • Start with reward-to-go REINFORCE on a toy env to verify gradient sign before scaling.
  • Add value baseline or critic before training on real hardware.
  • Use GAE for advantage estimation in PPO/A2C implementations.
  • Log policy entropy, KL to previous policy, value loss, and episode return percentiles.
  • Vectorize environments; target 10k–100k steps per PPO update on GPU.
  • Checkpoint actor and critic separately; export actor-only for deployment.
  • Wrap live policies with hard safety overrides independent of learned actions.
  • Version sim assets with policy checkpoints for reproducible revalidation.
  • Compare against strong baselines (A*, heuristic controller) before claiming RL wins.

Key takeaways

  • Policy gradients optimize the behavior rule directly via the log-probability trick — essential for continuous and stochastic control.
  • Vanilla REINFORCE is a teaching tool; production stacks use baselines, critics, and clipped objectives (PPO).
  • Advantage estimation and GAE are the main levers for variance reduction and stable learning.
  • On-policy PPO trades sample efficiency for stability; off-policy SAC wins when real-world samples are expensive.
  • Harbor-style deployments pair sim-trained actors with safety wrappers and localization systems that stay outside the RL loop.

Related reading