Guide
GRPO explained
A math tutor model must learn to reason, not just guess. Supervised fine-tuning on worked solutions teaches format, but it does not teach the model to explore better chains of thought when the first attempt fails. Reinforcement learning can fix that — but classic PPO for LLMs needs a separate value (critic) network, fragile advantage estimates, and a reward model that is expensive to train and easy to hack. Group relative policy optimization (GRPO), introduced in the DeepSeekMath work and popularized by reasoning-focused models like DeepSeek-R1, sidesteps the critic entirely. For each prompt the policy samples a group of completions, scores them with a reward function (often verifiable: did the final answer match?), normalizes rewards within that group, and applies a policy-gradient update weighted by relative advantage. No pairwise preference labels, no reward-model regression — just outcome feedback on multiple tries per question. This guide covers the GRPO training loop, reward design for math and code, how GRPO compares to PPO and DPO, integration with chain-of-thought and test-time compute, a Harbor Analytics tutor worked example, a method decision table, common pitfalls, and a production checklist.
What problem GRPO solves
Alignment methods split into two families. Preference optimization (DPO, ORPO, KTO) learns from human rankings: “answer A is better than B.” That works for tone, helpfulness, and refusal style, but it struggles when the training signal is a single scalar outcome — correct proof, passing unit tests, compiled SQL that returns the right row count. You could hire annotators to rank eight chain-of-thought traces per problem, but that does not scale for millions of STEM prompts.
Policy-gradient RL instead optimizes expected reward directly. PPO samples completions, evaluates a learned reward model, estimates advantages with a critic, and clips policy updates. The critic doubles memory (two full forward passes through a transformer-sized value head) and couples stability to advantage estimation quality. When rewards are sparse (0 for wrong, 1 for right), per-token critics are especially noisy.
GRPO’s insight: if you always sample G completions per prompt, you can define advantage relative to the other samples in the batch. A completion that solves the problem when siblings failed gets positive advantage; a wrong answer in a group where another sample was correct gets negative advantage. Subtracting the group mean (and optionally dividing by group standard deviation) acts as a variance reducer similar to a baseline in REINFORCE — without training a separate baseline network.
When GRPO is a good fit
- Verifiable rewards — math answer checking, code execution, SQL result sets, structured JSON schema validation.
- Reasoning and exploration — you want the model to try longer chains of thought and learn from failed attempts in the same group.
- GPU memory is tight — dropping the critic frees memory for larger groups or bigger base models.
- Outcome labels exist — problem/solution datasets (GSM8K, MATH, HumanEval) rather than pairwise chat preferences.
GRPO is a weaker default for open-ended chat alignment, subjective creativity, or safety nuance where human preferences are the ground truth and no cheap verifier exists. Hybrid stacks are common: SFT on demonstrations, GRPO on reasoning subsets, DPO on conversational tone.
How GRPO works step by step
Training iterates over prompts x from a dataset (math word problems,
coding tasks, logic puzzles). For each prompt:
- Sample a group — draw
Gcompletions{y1, …, yG}from the current policyπθ, usually with temperature > 0 to encourage diversity inside the group. - Score each completion — apply reward function
ri = R(x, yi). Examples: binary correctness after answer extraction, partial credit for passing some test cases, format bonus for valid LaTeX boxed answers. - Compute group-relative advantages —
Ai = (ri − μgroup) / (σgroup + ε)whereμandσare mean and std over theGrewards. Some implementations use only mean subtraction. - Policy gradient update — increase log-probability of
completions with positive
Ai, decrease negative ones, typically with a KL penalty to a reference SFT policy (same role as in PPO/DPO) so the model does not collapse into reward hacking.
Why grouping matters
Absolute rewards drift with curriculum difficulty: early batches may average 0.05
correctness, later batches 0.4. A fixed global baseline mis-calibrates gradients.
Group normalization makes each prompt a mini-tournament: the policy learns
“what differentiated the winner here” rather than “was this
globally above average.” Larger G (8–64 in published
recipes) lowers gradient variance but multiplies inference cost during training.
KL constraint and reference policy
Like PPO, GRPO keeps a frozen reference model from the SFT checkpoint and penalizes KL divergence from it. Without KL, models exploit verifiers — repeat the question, emit gibberish with a lucky hash collision, or pad reasoning with filler tokens that correlate with success on a narrow train set. The KL coefficient trades exploration against stability; teams often anneal it as reward variance falls.
Reward design for reasoning tasks
GRPO quality is bounded by reward quality. A noisy verifier teaches noise.
Math and STEM
- Extract final answer — parse
\boxed{}or “####” markers; compare normalized numeric or symbolic forms (SymPy). - Binary vs shaped rewards — pure 0/1 is simple but sparse; optional small format reward (+0.1 for valid reasoning tags) can speed early training but may cause format hacking if left unannealed.
- Process supervision — step-level labels (PRM) are not
required for GRPO but can be blended:
r = 0.7 · outcome + 0.3 · mean step score.
Code generation
- Sandbox execution — run unit tests in isolated containers; reward = passed_tests / total_tests.
- Timeout and safety — cap runtime; treat timeouts as zero reward, not crashes that leak gradients.
- Hidden tests — hold out tests for eval only; training on public tests alone invites memorization.
Combining with chain-of-thought
Reasoning models train the policy to emit explicit intermediate steps before the final answer. GRPO rewards only the outcome, but gradients flow through the full sequence — the model learns which reasoning paths precede correct answers inside each group. Pair with test-time compute (best-of-N sampling at inference) for additional gains without more RL steps.
Training workflow and practical settings
- SFT cold start — instruction-tune on chain-of-thought demonstrations so the policy already speaks in steps and respects answer format. GRPO without SFT wastes group samples on unparseable outputs.
- Choose group size G — start with 8 on 7B models; scale to
16–32 if GPU memory allows. Monitor cost: training forward passes scale
linearly with
G. - Sample with diversity — temperature 0.7–1.0; if all group members are identical wrong answers, advantage is zero and the step is wasted.
- Batch prompts, not completions — pipeline verifiers asynchronously; math checking is CPU-bound compared to GPU generation.
- Evaluate on held-out verifiers — track pass@1 and pass@G; rising pass@G with flat pass@1 suggests the policy diversifies without improving single-sample quality.
- Checkpoint merging — like other RL stages, exponential moving average of weights or pick checkpoint by eval pass@1, not training reward (which overfits verifiers).
Frameworks implementing GRPO-style loops include OpenRLHF, veRL, TRL experimental trainers, and custom DeepSeek-style stacks. Hyperparameters overlap PPO: learning rate often lower than SFT, gradient clipping essential, mixed precision standard on modern GPUs. Use LoRA on large bases when full finetune is prohibitive — RL updates are noisier than SFT, so adapter rank may need to be higher than SFT-only recipes.
Worked example: Harbor Analytics math tutor
Harbor Analytics ships an internal tutor for warehouse staff learning inventory arithmetic and basic statistics. The base 7B instruct model SFT’d on 12k worked solutions reached 54% on a held-out word-problem set, but pilots showed brittle reasoning: models guessed final numbers without intermediate checks.
- Dataset — 28k grade-school through early algebra prompts with sympy-verifiable answers; held-out 2k for eval.
- GRPO setup —
G = 12samples per prompt, temperature 0.9, binary reward on extracted answer, KL coef 0.04 to SFT reference, LoRA rank 128 on attention layers, 1.5 epochs. - Verifier — parse
FINAL:line; normalize fractions and units; 200ms timeout per check. - Monitoring — weekly human audit on 50 random chains for logical errors despite correct numbers (reward hacking signal).
Results after GRPO: pass@1 54% → 71%; pass@12 on eval 54% → 89% (showing the policy learned diverse successful strategies). Median reasoning length +34% but incorrect-guess rate fell — longer chains correlated with verification steps, not padding. They did not deploy raw RL weights: merged best checkpoint 0.7 with SFT 0.3 for stability on out-of-domain HR policy questions mixed into the same chat UI.
A failed ablation used G = 4 with greedy decoding: group advantages were
almost always zero after step 2,000, and training stalled. Raising G and
temperature restored learning — a common pattern when GRPO appears “broken”
but is actually starved of within-group variance.
Method decision table
| Method | Signal type | Critic / value net | Best for |
|---|---|---|---|
| GRPO | Scalar outcome reward | No | Math, code, verifiable reasoning; memory-efficient RL |
| PPO + reward model | Learned reward on each token/step | Yes | Chat alignment, multi-objective rewards, online RLHF |
| DPO / ORPO | Pairwise preferences | No | Tone, helpfulness, style; no verifier available |
| Constitutional AI / RLAIF | AI-generated preference or critique | Optional | Safety principles, scalable oversight |
Many production pipelines chain methods: SFT → GRPO on reasoning data → DPO on user preference logs. Order matters — running DPO before GRPO can lock in verbose chat style that resists shorter reasoning traces.
Common pitfalls
- Zero-variance groups — all
Gsamples wrong or identical; no gradient. Fix with higher temperature, largerG, or curriculum on easier prompts. - Verifier hacking — model learns answer format without reasoning; add hidden tests, human spot checks, or process reward models.
- Reward scale drift — changing verifier strictness mid-run shifts group means; version verifiers and freeze eval suites.
- Ignoring KL — policy diverges from SFT, general chat quality collapses; monitor KL and reference perplexity on non-RL prompts.
- Training on public test suites — inflated HumanEval scores that fail on renamed function signatures; hold out secret tests.
- Confusing pass@G with pass@1 — deployment uses single sample unless you budget for best-of-N at inference.
Production checklist
- SFT checkpoint with stable answer format and chain-of-thought template.
- Reward function unit-tested on edge cases (empty output, malformed JSON, division by zero).
- Group size and temperature set so >30% of groups contain at least one positive reward early in training.
- KL to reference tracked per step; alert if KL spikes or collapses to zero.
- Held-out verifiers disjoint from training tests; human audit slice for logical validity.
- Checkpoint selection by pass@1 on eval, not training reward average.
- Sandboxed code execution with CPU/time limits and no network egress.
- Rollback path to SFT or DPO weights if RL degrades general instruction following.
- Log sample groups (prompt + rewards + advantages) for debugging failed runs.
- Document compute budget: GRPO training cost scales with
G× dataset × epochs.
Key takeaways
- GRPO optimizes LLM policies with group-relative advantages — no critic network required.
- It shines when rewards are cheap and verifiable (math, code, structured outputs).
- Within-group normalization stabilizes training across varying prompt difficulty.
- Pair GRPO with SFT cold starts and KL constraints to avoid verifier hacking and capability loss.
- Use DPO or PPO where human preferences matter and no reliable verifier exists.
Related reading
- Reinforcement learning explained — MDPs, policy gradients, PPO, and RLHF context
- Direct preference optimization (DPO) explained — align without RL using pairwise labels
- LLM chain-of-thought explained — reasoning traces and prompting patterns
- LLM fine-tuning explained — SFT, LoRA, and when training pays off