Guide

LLM GEPA prompt optimization explained

Harbor Analytics maintained 140 internal policy documents — expense caps, travel approvals, data-retention tiers. Their RAG assistant answered “Can I expense a client dinner in Berlin?” with hand-tuned prompts and six few-shot demos copied from a spreadsheet. Accuracy on a 240-question dev set plateaued at 71%. Edge cases failed in predictable ways: multi-jurisdiction rules, effective-date clauses, and negation (“except when pre-approved”). Engineers recompiled the same DSPy ChainOfThought module with GEPA (Generative Evolution for Prompt Adaptation). Accuracy rose to 89%; average output tokens fell 18% because the optimizer pruned verbose chain-of-thought templates that looked helpful in demos but hurt generalization.

GEPA is a teleprompter — a compile-time optimizer that evolves instructions and demonstrations using reflective mutation and Pareto-efficient search over your metric(s). Unlike optimizers that only shuffle few-shot examples, GEPA rewrites the natural-language rules your module ships to the LM. This guide covers the GEPA loop, reflection traces, multi-objective Pareto fronts, the Harbor Analytics refactor, a technique decision table versus BootstrapFewShot and manual prompt engineering, pitfalls, and a production checklist.

What GEPA optimizes (and what it leaves alone)

In DSPy, a Module (e.g. Predict, ChainOfThought, ReAct) wraps a Signature — typed input and output fields. At runtime the module renders a prompt from:

Instructions — natural-language task description and constraints.
Demonstrations — input/output pairs prepended as few-shot examples.
LM configuration — model name, temperature, max tokens.

GEPA searches the instruction text and demonstration set jointly. It does not fine-tune model weights (see fine-tuning when you need weight updates). It does not replace retrieval — your reranker and chunking pipeline stay upstream. GEPA answers: given this program and metric, what instructions and demos should we bake in at compile time?

The GEPA loop: evaluate, reflect, mutate, select

GEPA treats prompt optimization as evolutionary search guided by an LM critic:

Initialize a population of candidate prompts (seeded from your hand-written instruction or a minimal default).
Evaluate each candidate on a training/dev slice with your metric (exact match, F1, LLM-judge score, latency proxy, etc.).
Reflect on failures: for mispredicted examples, an LM writes a short diagnosis (“ignored effective-date clause”, “confused EUR cap with USD cap”).
Mutate instructions using reflection traces — add constraints, reorder steps, delete redundant CoT scaffolding, propose new demos.
Select candidates on a Pareto front when multiple metrics matter (accuracy vs token cost vs refusal rate).
Repeat until budget (iterations, LM calls, wall clock) exhausts; export the winning compiled module.

Reflection is the differentiator from pure bootstrap methods. BootstrapFewShot copies successful trajectories into demos but rarely rewrites the instruction paragraph humans wrote on day one. GEPA uses failure analysis to edit that paragraph directly — similar in spirit to Self-Refine but at compile time across hundreds of examples, not per request at inference.

Pareto-efficient search when one metric is not enough

Production systems rarely optimize accuracy alone. Harbor Analytics tracked:

Policy accuracy — answer matches gold on structured eval.
Token cost — sum of prompt + completion tokens (CoT bloat was expensive).
Hallucination rate — citations not supported by retrieved chunks.

Single-score weighted sums hide tradeoffs: a prompt that hits 92% accuracy with 3x tokens may lose to 89% at 1.2x tokens under a budget cap. GEPA maintains a Pareto frontier — candidates where no other candidate is strictly better on all metrics. Operators pick a point on the front for deployment or set hard constraints (“accuracy ≥ 85%, minimize tokens”).

Pair scalar metrics with LLM-as-judge only on a held-out slice; judges are noisy and expensive inside tight compile loops. Use deterministic checks (regex on dates, JSON schema match) wherever possible.

GEPA vs other DSPy teleprompters

Optimizer	What it changes	Best when
BootstrapFewShot	Demonstrations only	Instructions are already strong; you need labeled trajectories as demos
BootstrapFewShotWithRandomSearch	Demo subsets + light randomness	Small dev sets; cheap compile budget
MIPROv2	Instructions + demos via Bayesian-style search	Medium programs; multi-module pipelines
GEPA	Instructions + demos via reflective mutation + Pareto selection	Instructions are wrong or stale; multi-metric tradeoffs; failure modes are interpretable

Harbor Analytics tried BootstrapFewShot first (+6 points accuracy). GEPA added another +12 by rewriting the instruction block to require jurisdiction tags and effective-date quotes from context — patterns engineers had not spelled out manually.

Harbor Analytics refactor: compile setup and results

The team’s DSPy program (simplified):

Retrieve top-8 chunks from hybrid search.
ChainOfThought with Signature: context, question -> reasoning, answer, citations.
Metric: exact match on normalized answer + penalty if citations missing from context.

GEPA compile configuration:

Train slice: 180 examples; validation: 60 held out.
Reflection LM: same family as production, lower temperature for critique.
Budget: 40 candidate generations, 8 Pareto survivors per generation.
Constraints: hallucination rate must not exceed baseline; maximize accuracy.

Winning instruction diff (conceptual): removed “think step by step about everything”; added “quote the effective date if multiple policy versions appear” and “answer NO if jurisdiction tag does not match user locale field.” Demos dropped from six generic to four targeted failure recoveries GEPA discovered. Deployed compiled module to production with versioned prompt registry hash gepa-2026-06-v3.

Technique decision table

Approach	Best when	Skip when
GEPA compile	DSPy program exists; 100+ labeled examples; instructions likely suboptimal; multi-metric goals	No eval set; metric is vibes-only; compile LM budget unavailable
Manual prompt engineering	<20 eval cases; one-shot prototype; compliance forbids automated mutation logs	Quality plateaus across model upgrades; team lacks time for regression sweeps
BootstrapFewShot only	Instructions expert-written; only demos missing	Systematic failure themes (dates, negation, units) persist after demo boost
Fine-tuning	Thousands of labels; frozen prompt; latency-critical single-pass generation	Policies change weekly; retrieval context must drive answers
Per-request Self-Refine	High-stakes single queries; compile-time generalization insufficient	p95 latency budget <2s; cost per query dominates

Metrics that make GEPA work

GEPA is only as honest as your metric. Strong patterns:

Decompose — separate scores for retrieval hit, answer correctness, citation faithfulness.
Stratify — report negation, date, and unit buckets; optimizers overfit to head topics.
Hold out — never reflect on validation examples; leakage inflates compile scores.
Version lock — freeze retrieval index and LM vendor for compile vs A/B in prod.

For RAG programs, add a retrieval recall gate: if gold chunk is outside top-k, mark example as retrieval failure so GEPA does not blame the generator for bad context. See RAG evaluation for end-to-end harness design.

Common pitfalls

Tiny train sets — reflection overfits anecdotes; aim for 100+ diverse failures.
Judge-only metrics — GEPA games LLM judges; anchor with deterministic checks.
Compile on production LM, deploy on mini — instruction tricks may not transfer.
Unbounded CoT growth — mutations add “think more” steps; cap reasoning tokens in metric.
Ignoring retrieval — 89% generator accuracy with 40% recall@8 is still a broken product.
No prompt registry — compiled artifacts drift from git; cannot reproduce incidents.
Single-metric selection from Pareto front — document why you picked accuracy over cost.

Production checklist

Express pipeline as DSPy Modules with explicit Signatures before compiling.
Build 100+ labeled examples with stratified failure tags (negation, dates, units).
Define 2–3 metrics for Pareto search (accuracy, cost, faithfulness).
Hold out 20–30% validation never seen during reflection.
Run BootstrapFewShot baseline; record lift before GEPA spend.
Cap reflection/mutation LM calls per compile job; log total token spend.
Export compiled module + instruction hash to prompt registry.
Re-compile on LM vendor upgrades; treat as regression gate.
Monitor production slices matching train buckets weekly.
Pair compile-time GEPA with runtime Corrective RAG only if retrieval errors dominate post-deploy.

Key takeaways

GEPA evolves instructions and demos via reflective mutation, not just few-shot bootstrapping.
Pareto search handles accuracy vs cost vs faithfulness without hiding tradeoffs in one weighted score.
Harbor Analytics rose from 71% to 89% policy Q&A accuracy while cutting output tokens 18%.
GEPA complements DSPy Modules; it does not replace retrieval tuning or weight fine-tuning.
Honest metrics and held-out validation determine whether compile gains survive production.