Guide

Policy gradient methods explained

Harbor Logistics' warehouse AMRs followed a hand-tuned Q-table for aisle selection: discretize pose into a 40×40 grid, learn action values offline, deploy greedily. It worked until SKU layouts changed weekly. Re-tuning took days, and the tabular policy could not express smooth trade-offs between tight turns and long straightaways. The robotics team switched to policy gradient training with PPO: a neural network maps lidar-rich observations directly to a distribution over steering and speed commands. After three simulator weeks and two days of safe on-floor fine-tuning, mean pick time dropped 19% with fewer near-miss events. Policy gradient methods optimize a parameterized policy π_θ(a|s) by ascending the gradient of expected return with respect to parameters θ. Unlike value-based methods that learn Q(s,a) and derive actions indirectly, policy gradients learn the behavior rule itself — natural for continuous control, stochastic policies, and high-dimensional action spaces. This guide covers the policy gradient theorem, REINFORCE, variance-reduction tricks, actor-critic architectures, PPO, the Harbor refactor, a method decision table vs Q-learning and planning, pitfalls, and a production checklist alongside our reinforcement learning overview, MDP foundations, and RLHF alignment guide.

Why optimize the policy directly?

Value-based methods like Q-learning estimate how good each action is, then act greedily or ε-greedily. That works when actions are discrete and few. Breakdowns appear when:

Actions are continuous (joint torques, throttle, steering angle) — argmax over an infinite set needs a separate optimization each step.
Stochastic policies help exploration — robotics and dialogue benefit from sampling, not deterministic argmax.
The policy class is constrained — you want a smooth Gaussian over velocities, not a brittle lookup table.
Action space is large or structured — language generation treats each token as an action; policy gradients underpin modern RLHF pipelines.

Policy gradients trade higher gradient variance for direct control over the policy shape. With baselines, critics, and trust-region updates, they scale to humanoid locomotion, StarCraft agents, and LLM fine-tuning.

The policy gradient theorem

Goal: maximize J(θ) = E_{τ~π_θ}[ G₀ ] where τ is a trajectory and G₀ is discounted return. The policy gradient theorem (Sutton et al.) gives an unbiased gradient estimator without differentiating through the environment transition model:

∇_θ J(θ) = E_τ [ ∑_t=0^T ∇_θ log π_θ(a_t|s_t) · G_t ]

Intuition: increase log-probability of actions that preceded high returns; decrease those that preceded poor returns. The log-derivative trick moves the gradient inside the expectation over trajectories sampled from the current policy — the core of REINFORCE and its descendants.

Episodic vs continuing tasks

In episodic tasks, G_t is return-to-go from timestep t. In continuing tasks, use discounted returns or average-reward formulations. Harbor's pick episodes terminate when the tote is full or the order list is empty (typically 90–240 seconds).

REINFORCE: Monte Carlo policy gradient

REINFORCE (Williams, 1992) is the vanilla algorithm:

Sample full episodes using π_θ.
For each timestep t, compute return G_t.
Update θ ← θ + α ∇_θ log π_θ(a_t|s_t) G_t.

Simple and correct, but high variance: a lucky final reward upweights every early action equally. A single collision near the end can poison gradients for an otherwise good aisle approach. Production systems rarely stop at vanilla REINFORCE; they add baselines and critics.

Reward-to-go variant

Use return from t onward only, not the full episode return at every step. This removes variance from rewards already collected before t and is standard practice.



      
        Variance reduction: baselines and advantages
        
          Subtract a baseline b(s_t) that does not depend on the
          sampled action — the gradient remains unbiased if b is action-independent:
        
        
          ∇_θ J ≈ E [ ∑_t ∇_θ log π_θ(a_t|s_t) (G_t - b(s_t)) ]
        
        
          A common baseline is the state-value estimate
          V^π(s). Define the advantage:
          A^π(s,a) = Q^π(s,a) - V^π(s)
          — how much better action a is than average in state
          s. Policy updates use ∇ log π(a|s) · A(s,a),
          which lowers variance dramatically. See
          MDPs and value functions
          for the Bellman foundations behind V and Q.
        
        Generalized Advantage Estimation (GAE)
        
          GAE blends multi-step returns with a trace parameter
          λ to bias-variance-tradeoff advantage estimates. PPO and
          A2C implementations almost always use GAE with λ ∈ [0.9, 0.98]
          rather than single-step TD errors alone.
        
      

      
        Actor-critic architectures
        
          An actor-critic splits roles:
        
        
          Actor — parameterized policy π_θ(a|s).
          Critic — value function V_φ(s) or Q_φ(s,a) estimating returns for baseline and advantage.
        
        
          The critic is trained with TD or Monte Carlo regression; the actor uses the
          critic's advantage signal. Updates can alternate or run in parallel.
        
        A2C and A3C
        
          Advantage Actor-Critic (A2C) synchronously rolls out multiple
          workers and averages gradients. A3C (asynchronous) uses
          Hogwild-style parallel actors with stale parameters — historically popular,
          largely superseded by PPO with vectorized envs on GPUs.
        
        Continuous control: Gaussian policies
        
          For continuous actions, the actor outputs mean and log-standard-deviation of a
          Gaussian (or squashed Gaussian via tanh for bounded actions). Log-probabilities
          are tractable; reparameterization is not required for REINFORCE-style updates
          but appears in off-policy actor-critic methods like SAC.
        
      

      
        PPO: stable on-policy updates at scale
        
          Proximal Policy Optimization (PPO) (Schulman et al., 2017) is
          the default on-policy workhorse. It maximizes a clipped surrogate objective
          that prevents destructively large policy updates:
        
        
          L^CLIP(θ) = E [ min( r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t ) ]
        
        
          where r_t(θ) = π_θ(a_t|s_t) / π_θold(a_t|s_t)
          is the probability ratio and Â_t is the advantage estimate.
          Clipping with ε ≈ 0.1–0.2 keeps new and old policies
          close while still improving reward. PPO is simpler than TRPO (trust region with
          KL constraints) and empirically robust across robotics, games, and
          LLM reasoning fine-tuning variants.
        
        Practical PPO hyperparameters
        
          Rollout length: 128–2048 steps per env before update.
          Epochs per batch: 3–10 passes over the same data (on-policy reuse).
          Entropy bonus: encourages exploration; decay as policy matures.
          Value loss coefficient: typically 0.5; clip value targets like policy for stability.
          Gradient clipping: global norm 0.5–1.0 prevents critic spikes from destabilizing the actor.
        
      

      
        On-policy vs off-policy policy gradients
        
          On-policy methods (REINFORCE, A2C, PPO) learn from data generated
          by the current policy. Sample efficiency is lower but stability is higher.
          Off-policy actor-critics (DDPG, TD3, SAC) reuse replay buffers
          and learn from behavior older than the current policy — better sample
          efficiency for expensive simulators, but more hyperparameter sensitivity.
          Harbor chose on-policy PPO because the simulator was cheap and safety reviewers
          wanted a clear bound on how far each deploy could drift from the last validated
          checkpoint.
        
        Connection to bandits and planning
        
          The single-state bandit is a degenerate MDP; policy gradient reduces to
          weighted log-likelihood updates on arms. For lookahead planning without gradient
          steps, see
          Monte Carlo tree search
          and the stateless
          multi-armed bandits
          guide.
        
      

      
        Harbor Logistics: AMR pick-path policy refactor
        
          Harbor's mid-size fulfillment center runs eight autonomous mobile robots on
          a 1.2 km² grid with dynamic pick lists. The legacy pipeline:
        
        
          Discretize pose (x, y, heading) into 40×40×8 bins.
          Offline Q-learning on a simplified simulator (no other robots).
          Deploy greedy Q with a static obstacle map.
        
        
          Failures clustered around layout changes, bidirectional aisle sharing, and
          non-Markovian congestion (other robots not in state). Retraining the Q-table
          after each layout tweak took 2–4 engineer-days.
        
        
          The replacement PPO stack:
        
        
          Observation: 72-dim vector — local occupancy grid (48), goal-relative pose (8), remaining tote slots (4), time-to-deadline scalar (1), neighbor robot relative positions (11).
          Action: continuous forward speed and yaw rate, squashed to warehouse limits.
          Reward: +1 per successful pick, −0.01 per second, −5 near-collision (lidar threshold), −20 actual collision, +3 early completion bonus.
          Training: 512 parallel Isaac-style sim envs, GAE λ=0.95, clip ε=0.15, 4M steps to convergence.
          Deploy: ONNX actor on robot; critic dropped; two-day on-floor fine-tune with safety wrapper that overrides if lidar clearance < 30 cm.
        
        
          Results over four weeks live: mean pick cycle −19%, near-miss rate −34%,
          layout-change revalidation dropped from days to re-running a fixed 50k-step
          fine-tune overnight. The team still maintains a model-based
          particle filter
          for localization; policy gradients handle where to go next, not pose estimation.
        
      

      
        Method decision table
        
          
            
              Approach
              Best for
              Tradeoff
            
          
          
            
              REINFORCE / reward-to-go
              Teaching, small discrete envs, proof-of-concept
              High variance; slow convergence
            
            
              Actor-critic (A2C)
              Moderate-scale discrete/continuous control
              Sensitive to critic bias; superseded by PPO in most cases
            
            
              PPO
              Robotics sim2real, games, on-policy LLM RL
              Sample-inefficient vs off-policy; needs many parallel envs
            
            
              SAC / TD3
              Expensive real-world steps, continuous control
              Off-policy instability; harder safety certification
            
            
              DQN / Q-learning
              Discrete actions, Atari-style benchmarks
              Awkward for continuous or stochastic policies
            
            
              MCTS + policy prior
              Perfect-information games, planning at inference
              Compute-heavy online; not end-to-end learned control
            
          
        
      

      
        Common pitfalls
        
          Ignoring reward scale: unnormalized returns explode critic loss; standardize advantages per batch.
          Sparse rewards without shaping: policy gradient needs signal; add dense proxies (distance-to-goal) then anneal.
          Non-stationary data in on-policy loops: PPO reuses rollouts for multiple epochs — too many epochs overfits stale data.
          Critic lag: if V(s) is wrong, advantages mis-rank actions; tune critic learning rate ≥ actor rate.
          Entropy collapse: policy becomes deterministic too early; monitor entropy and use bonus decay schedules.
          Sim-to-real gap: randomize friction, sensor noise, and delays in sim; Harbor's safety wrapper is non-negotiable on floor.
          Violating Markov state: omitting other agents or hidden inventory states makes “optimal” policies brittle — enrich observations or use centralized training.
          Deploying without action bounds: Gaussian tails can command illegal speeds; squash or clip at inference.
        
      

      
        Practitioner checklist
        
          Define MDP tuple explicitly: state, action, reward, episode termination (see MDP guide).
          Start with reward-to-go REINFORCE on a toy env to verify gradient sign before scaling.
          Add value baseline or critic before training on real hardware.
          Use GAE for advantage estimation in PPO/A2C implementations.
          Log policy entropy, KL to previous policy, value loss, and episode return percentiles.
          Vectorize environments; target 10k–100k steps per PPO update on GPU.
          Checkpoint actor and critic separately; export actor-only for deployment.
          Wrap live policies with hard safety overrides independent of learned actions.
          Version sim assets with policy checkpoints for reproducible revalidation.
          Compare against strong baselines (A*, heuristic controller) before claiming RL wins.
        
      

      
        Key takeaways
        
          Policy gradients optimize the behavior rule directly via the log-probability trick — essential for continuous and stochastic control.
          Vanilla REINFORCE is a teaching tool; production stacks use baselines, critics, and clipped objectives (PPO).
          Advantage estimation and GAE are the main levers for variance reduction and stable learning.
          On-policy PPO trades sample efficiency for stability; off-policy SAC wins when real-world samples are expensive.
          Harbor-style deployments pair sim-trained actors with safety wrappers and localization systems that stay outside the RL loop.
        
      

      
        Related reading
        
          Reinforcement learning explained — MDPs, exploration, Q-learning, and RLHF landscape
          Markov decision processes explained — Bellman equations, value iteration, and optimal policies
          RLHF explained — policy optimization for aligning language models with human preferences
          Monte Carlo tree search explained — planning with learned policy priors
        
        
          All guides
          Reinforcement learning

Approach	Best for	Tradeoff
REINFORCE / reward-to-go	Teaching, small discrete envs, proof-of-concept	High variance; slow convergence
Actor-critic (A2C)	Moderate-scale discrete/continuous control	Sensitive to critic bias; superseded by PPO in most cases
PPO	Robotics sim2real, games, on-policy LLM RL	Sample-inefficient vs off-policy; needs many parallel envs
SAC / TD3	Expensive real-world steps, continuous control	Off-policy instability; harder safety certification
DQN / Q-learning	Discrete actions, Atari-style benchmarks	Awkward for continuous or stochastic policies
MCTS + policy prior	Perfect-information games, planning at inference	Compute-heavy online; not end-to-end learned control