Guide
Reinforcement learning explained
Most machine learning learns from labeled examples: this image is a cat, this email is spam. Reinforcement learning (RL) learns from consequences. An agent takes actions in an environment, receives scalar rewards, and adjusts its policy — the rule mapping situations to behavior — to maximize cumulative reward over time. RL trained AlphaGo, powers warehouse robots, tunes recommendation systems, and (through RLHF) helps align chatbots with human preferences. This guide explains the core math intuition, major algorithms, engineering pitfalls, and when RL beats supervised learning.
How RL differs from supervised learning
In supervised learning, every training example includes the correct answer. The model minimizes prediction error against fixed labels. In RL, the agent never sees an oracle action — only the reward signal that follows its own choices. A chess move might look brilliant until a counterattack three turns later reveals it was a blunder. Credit assignment — figuring out which past actions caused which outcomes — is the central difficulty.
RL also deals with sequential decision-making. Today's action changes tomorrow's state. A trading bot that maximizes today's profit might bankrupt the portfolio next week. The objective is usually discounted cumulative reward: sum future rewards with a factor γ (gamma) between 0 and 1 so distant outcomes matter less than immediate ones. Our machine learning fundamentals guide contrasts supervised, unsupervised, and reinforcement paradigms at a high level; this page goes deep on the third.
The Markov decision process (MDP)
RL problems are formalized as Markov decision processes. An MDP has:
- States (S) — a description of the situation (board position, robot joint angles, game screen pixels).
- Actions (A) — what the agent can do (move piece, apply torque, click button).
- Transition dynamics — probability of landing in state s′ after taking action a in state s. The environment may be stochastic (dice rolls, sensor noise).
- Reward function R(s, a, s′) — immediate feedback, often sparse (+1 for winning, 0 otherwise).
- Discount γ — how much future reward counts today.
The Markov property says the future depends only on the current state, not the full history. When raw observations violate that (partial observability), agents use memory — recurrent networks or frame stacks — to build a sufficient state estimate.
Policy, value functions, and the objective
A policy π(a|s) is the strategy: probability of each action given state. Deterministic policies pick one action; stochastic policies sample (useful for exploration). The state-value function Vπ(s) estimates expected cumulative reward starting from s and following π. The action-value function Qπ(s,a) estimates reward if you take a in s then follow π afterward.
Optimal policies maximize expected return. In tabular settings (small finite state spaces), dynamic programming can solve MDPs exactly. Real problems — Atari pixels, continuous control — need function approximation, usually neural networks. That intersection is deep reinforcement learning, covered in our deep learning guide.
Exploration vs exploitation
An agent that only repeats known-good actions never discovers better ones (exploitation). One that acts randomly never consolidates learning (exploration). Balancing the two is non-trivial: the multi-armed bandit problem captures the simplest version (one state, many actions).
Common exploration strategies:
- ε-greedy — with probability ε, pick a random action; otherwise pick the best known action. Decay ε over training.
- Boltzmann / softmax — sample actions proportional to estimated Q-values, temperature-controlled.
- Upper confidence bound (UCB) — favor actions with high uncertainty.
- Intrinsic motivation — bonus reward for visiting novel states (curiosity-driven RL).
Under-exploration leaves reward on the table; over-exploration wastes samples in a domain where data is expensive. Robotics simulators can generate millions of episodes; real hardware might allow thousands. Sample budget drives algorithm choice.
Major algorithm families
Value-based: Q-learning and DQN
Q-learning learns Q(s,a) directly via temporal-difference updates: after observing reward r and next state s′, nudge Q(s,a) toward r + γ maxa′ Q(s′,a′). No model of the environment is required — model-free learning.
Deep Q-Networks (DQN) use a neural net as Q-function approximator on high-dimensional inputs (e.g. stacked Atari frames). Tricks that made it work: experience replay (random minibatches from a buffer break correlation), target networks (slow-moving copy stabilizes bootstrap targets), and reward clipping. DQN sparked the modern deep RL era but struggles with continuous action spaces and can overestimate Q-values.
Policy-based: REINFORCE and PPO
Policy gradient methods optimize π directly. REINFORCE increases probability of actions that led to high returns. Variance is high; baselines (value function estimates) reduce it. Proximal Policy Optimization (PPO) is the workhorse today: clipped objective prevents destructively large policy updates, stable enough for robotics, games, and RLHF fine-tuning at scale.
Actor-critic and model-based RL
Actor-critic combines a policy (actor) with a value estimator (critic). A3C, SAC (Soft Actor-Critic for continuous control), and TD3 are widely used variants. Model-based RL learns a transition model and plans inside it — sample-efficient when the model is accurate, brittle when it is not. Hybrid approaches (Dreamer, MuZero) remain active research frontiers.
Reward design and shaping
RL is only as good as the reward function. Sparse rewards (+1 at goal only) make learning slow; agents wander aimlessly for millions of steps. Dense shaping rewards — small bonuses for progress toward the goal — speed training but risk reward hacking: the agent satisfies the metric without achieving the intent. A classic example is a boat-racing agent collecting regeneration power-ups in circles forever because lap completion bonus was poorly specified.
Best practices:
- Align rewards with the true objective; audit trajectories manually.
- Penalize unsafe or degenerate behavior explicitly.
- Use constraint RL or Lagrangian methods when hard safety limits exist.
- Prefer inverse RL or imitation learning when expert demonstrations are available.
When labels exist, ask whether pure RL is necessary. Behavioral cloning (supervised learning on expert trajectories) is simpler and often a strong baseline before adding RL fine-tuning.
RLHF: reinforcement learning for language models
Large language models are first trained with supervised next-token prediction on text corpora. That teaches fluency, not alignment with human intent. Reinforcement Learning from Human Feedback (RLHF) adds a second stage:
- Collect human comparisons — which of two model outputs is better for a prompt.
- Train a reward model to predict human preference scores.
- Fine-tune the LLM with PPO (or similar) to maximize reward model score while staying close to the original model via a KL penalty (avoid collapse into gibberish that tricks the reward model).
RLHF is why assistants refuse harmful requests and follow instructions more reliably than raw pretrained models. It also introduces new failure modes: reward model exploitation, verbosity bias (longer answers score higher), and sensitivity to who labeled the data. Alternatives like DPO (Direct Preference Optimization) skip explicit RL loops by optimizing preferences directly — cheaper and increasingly popular. See our LLM fine-tuning guide for when RLHF, DPO, or supervised fine-tuning alone makes sense.
Engineering RL in production
Simulators and the reality gap
Training in simulation is cheap; deploying on real hardware introduces sim-to-real gap — physics mismatch, latency, wear. Domain randomization (vary friction, lighting, noise during training) and system identification help bridge it. Always budget for real-world fine-tuning.
Sample efficiency and offline RL
Online RL needs live interaction — expensive for finance, healthcare, or production systems. Offline RL learns from logged historical data without exploring online. Useful when logs exist but experimentation is risky; distribution shift between logging policy and learned policy remains a hazard.
Evaluation is harder than training
RL metrics fluctuate with seed, environment version, and opponent. Report mean and variance across many runs. Watch for policies that memorize training levels but fail on held-out scenarios. For LLM RLHF, maintain golden prompt sets and human eval — automatic reward scores lie.
Autonomous systems that call tools during RL-style loops share similar safety concerns with AI agents and tool use: sandbox actions, cap iteration count, and log every reward-shaping decision.
When to use RL (decision framework)
| Signal | Prefer |
|---|---|
| Labeled input-output pairs for every situation | Supervised learning |
| Expert demonstrations, no reward function | Imitation learning / behavioral cloning |
| Clear reward, sequential decisions, cheap rollouts | RL (PPO, SAC, etc.) |
| Human preference on generated text | RLHF or DPO |
| Historical logs only, no live exploration | Offline RL or conservative Q-learning |
| Combinatorial search with known rules | Planning (MCTS, A*) may beat RL |
RL shines when the optimal behavior is emergent — strategies too complex to label — and when simulation or safe exploration is affordable. It fails when rewards are misspecified, data is scarce, or a simpler method already solves 95% of the problem.
Common pitfalls
- Misspecified rewards — agent optimizes the wrong thing brilliantly.
- Ignoring non-stationarity — environment rules change; policy goes stale.
- Single-seed hero runs — publish mean ± std over ≥5 seeds.
- Leaking test environments into training — overfitting level layouts.
- Unbounded action spaces without normalization — unstable gradients in continuous control.
- Skipping supervised pretraining — cold-start RL wastes samples.
- Reward model overoptimization in RLHF — fluent nonsense that scores high.
Key takeaways
- RL learns policies from reward signals, not fixed labels — credit assignment is hard.
- Formalize problems as MDPs: states, actions, transitions, rewards, discount γ.
- Balance exploration and exploitation; sample budget dictates how aggressive exploration can be.
- Q-learning / DQN for discrete actions; PPO / SAC for continuous control and RLHF.
- Reward design is engineering — sparse rewards need shaping; shaping invites hacking.
- RLHF aligns LLMs via a learned reward model + policy optimization (or DPO as a simpler alternative).
- Evaluate on held-out scenarios with variance across seeds; sim-to-real gaps need explicit mitigation.
Related reading
- Machine learning fundamentals — supervised vs unsupervised vs RL overview
- Deep learning explained — neural networks that power DQN, PPO, and reward models
- LLM fine-tuning — RLHF, DPO, and when alignment training pays off
- AI agents and tool use — action loops, guardrails, and reward-like scoring in agents