Guide
LLM GEPA prompt optimization explained
Harbor Analytics maintained 140 internal policy documents — expense caps,
travel approvals, data-retention tiers. Their
RAG
assistant answered “Can I expense a client dinner in Berlin?” with
hand-tuned prompts and six few-shot demos copied from a spreadsheet. Accuracy on a
240-question dev set plateaued at 71%. Edge cases failed in predictable ways:
multi-jurisdiction rules, effective-date clauses, and negation (“except when
pre-approved”). Engineers recompiled the same
DSPy
ChainOfThought module with GEPA (Generative Evolution
for Prompt Adaptation). Accuracy rose to 89%; average output tokens fell 18% because
the optimizer pruned verbose chain-of-thought templates that looked helpful in demos
but hurt generalization.
GEPA is a teleprompter — a compile-time optimizer that evolves instructions and demonstrations using reflective mutation and Pareto-efficient search over your metric(s). Unlike optimizers that only shuffle few-shot examples, GEPA rewrites the natural-language rules your module ships to the LM. This guide covers the GEPA loop, reflection traces, multi-objective Pareto fronts, the Harbor Analytics refactor, a technique decision table versus BootstrapFewShot and manual prompt engineering, pitfalls, and a production checklist.
What GEPA optimizes (and what it leaves alone)
In DSPy, a Module (e.g. Predict,
ChainOfThought, ReAct) wraps a Signature
— typed input and output fields. At runtime the module renders a prompt from:
- Instructions — natural-language task description and constraints.
- Demonstrations — input/output pairs prepended as few-shot examples.
- LM configuration — model name, temperature, max tokens.
GEPA searches the instruction text and demonstration set jointly. It does not fine-tune model weights (see fine-tuning when you need weight updates). It does not replace retrieval — your reranker and chunking pipeline stay upstream. GEPA answers: given this program and metric, what instructions and demos should we bake in at compile time?
The GEPA loop: evaluate, reflect, mutate, select
GEPA treats prompt optimization as evolutionary search guided by an LM critic:
- Initialize a population of candidate prompts (seeded from your hand-written instruction or a minimal default).
- Evaluate each candidate on a training/dev slice with your metric (exact match, F1, LLM-judge score, latency proxy, etc.).
- Reflect on failures: for mispredicted examples, an LM writes a short diagnosis (“ignored effective-date clause”, “confused EUR cap with USD cap”).
- Mutate instructions using reflection traces — add constraints, reorder steps, delete redundant CoT scaffolding, propose new demos.
- Select candidates on a Pareto front when multiple metrics matter (accuracy vs token cost vs refusal rate).
- Repeat until budget (iterations, LM calls, wall clock) exhausts; export the winning compiled module.
Reflection is the differentiator from pure bootstrap methods. BootstrapFewShot copies successful trajectories into demos but rarely rewrites the instruction paragraph humans wrote on day one. GEPA uses failure analysis to edit that paragraph directly — similar in spirit to Self-Refine but at compile time across hundreds of examples, not per request at inference.
Pareto-efficient search when one metric is not enough
Production systems rarely optimize accuracy alone. Harbor Analytics tracked:
- Policy accuracy — answer matches gold on structured eval.
- Token cost — sum of prompt + completion tokens (CoT bloat was expensive).
- Hallucination rate — citations not supported by retrieved chunks.
Single-score weighted sums hide tradeoffs: a prompt that hits 92% accuracy with 3x tokens may lose to 89% at 1.2x tokens under a budget cap. GEPA maintains a Pareto frontier — candidates where no other candidate is strictly better on all metrics. Operators pick a point on the front for deployment or set hard constraints (“accuracy ≥ 85%, minimize tokens”).
Pair scalar metrics with LLM-as-judge only on a held-out slice; judges are noisy and expensive inside tight compile loops. Use deterministic checks (regex on dates, JSON schema match) wherever possible.
GEPA vs other DSPy teleprompters
| Optimizer | What it changes | Best when |
|---|---|---|
| BootstrapFewShot | Demonstrations only | Instructions are already strong; you need labeled trajectories as demos |
| BootstrapFewShotWithRandomSearch | Demo subsets + light randomness | Small dev sets; cheap compile budget |
| MIPROv2 | Instructions + demos via Bayesian-style search | Medium programs; multi-module pipelines |
| GEPA | Instructions + demos via reflective mutation + Pareto selection | Instructions are wrong or stale; multi-metric tradeoffs; failure modes are interpretable |
Harbor Analytics tried BootstrapFewShot first (+6 points accuracy). GEPA added another +12 by rewriting the instruction block to require jurisdiction tags and effective-date quotes from context — patterns engineers had not spelled out manually.
Harbor Analytics refactor: compile setup and results
The team’s DSPy program (simplified):
Retrievetop-8 chunks from hybrid search.ChainOfThoughtwith Signature:context, question -> reasoning, answer, citations.- Metric: exact match on normalized answer + penalty if citations missing from context.
GEPA compile configuration:
- Train slice: 180 examples; validation: 60 held out.
- Reflection LM: same family as production, lower temperature for critique.
- Budget: 40 candidate generations, 8 Pareto survivors per generation.
- Constraints: hallucination rate must not exceed baseline; maximize accuracy.
Winning instruction diff (conceptual): removed “think step by step about
everything”; added “quote the effective date if multiple policy versions
appear” and “answer NO if jurisdiction tag does not match user
locale field.” Demos dropped from six generic to four targeted failure recoveries
GEPA discovered. Deployed compiled module to production with
versioned prompt registry
hash gepa-2026-06-v3.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| GEPA compile | DSPy program exists; 100+ labeled examples; instructions likely suboptimal; multi-metric goals | No eval set; metric is vibes-only; compile LM budget unavailable |
| Manual prompt engineering | <20 eval cases; one-shot prototype; compliance forbids automated mutation logs | Quality plateaus across model upgrades; team lacks time for regression sweeps |
| BootstrapFewShot only | Instructions expert-written; only demos missing | Systematic failure themes (dates, negation, units) persist after demo boost |
| Fine-tuning | Thousands of labels; frozen prompt; latency-critical single-pass generation | Policies change weekly; retrieval context must drive answers |
| Per-request Self-Refine | High-stakes single queries; compile-time generalization insufficient | p95 latency budget <2s; cost per query dominates |
Metrics that make GEPA work
GEPA is only as honest as your metric. Strong patterns:
- Decompose — separate scores for retrieval hit, answer correctness, citation faithfulness.
- Stratify — report negation, date, and unit buckets; optimizers overfit to head topics.
- Hold out — never reflect on validation examples; leakage inflates compile scores.
- Version lock — freeze retrieval index and LM vendor for compile vs A/B in prod.
For RAG programs, add a retrieval recall gate: if gold chunk is outside top-k, mark example as retrieval failure so GEPA does not blame the generator for bad context. See RAG evaluation for end-to-end harness design.
Common pitfalls
- Tiny train sets — reflection overfits anecdotes; aim for 100+ diverse failures.
- Judge-only metrics — GEPA games LLM judges; anchor with deterministic checks.
- Compile on production LM, deploy on mini — instruction tricks may not transfer.
- Unbounded CoT growth — mutations add “think more” steps; cap reasoning tokens in metric.
- Ignoring retrieval — 89% generator accuracy with 40% recall@8 is still a broken product.
- No prompt registry — compiled artifacts drift from git; cannot reproduce incidents.
- Single-metric selection from Pareto front — document why you picked accuracy over cost.
Production checklist
- Express pipeline as DSPy Modules with explicit Signatures before compiling.
- Build 100+ labeled examples with stratified failure tags (negation, dates, units).
- Define 2–3 metrics for Pareto search (accuracy, cost, faithfulness).
- Hold out 20–30% validation never seen during reflection.
- Run BootstrapFewShot baseline; record lift before GEPA spend.
- Cap reflection/mutation LM calls per compile job; log total token spend.
- Export compiled module + instruction hash to prompt registry.
- Re-compile on LM vendor upgrades; treat as regression gate.
- Monitor production slices matching train buckets weekly.
- Pair compile-time GEPA with runtime Corrective RAG only if retrieval errors dominate post-deploy.
Key takeaways
- GEPA evolves instructions and demos via reflective mutation, not just few-shot bootstrapping.
- Pareto search handles accuracy vs cost vs faithfulness without hiding tradeoffs in one weighted score.
- Harbor Analytics rose from 71% to 89% policy Q&A accuracy while cutting output tokens 18%.
- GEPA complements DSPy Modules; it does not replace retrieval tuning or weight fine-tuning.
- Honest metrics and held-out validation determine whether compile gains survive production.
Related reading
- DSPy fundamentals explained — Signatures, Modules, and teleprompter basics
- LLM Self-Refine explained — per-request reflection loops at inference time
- LLM prompt versioning registry explained — ship compiled artifacts safely
- RAG evaluation explained — metrics harnesses for compile and prod