Guide
LLM self-refine explained
Harbor Analytics' natural-language-to-SQL assistant shipped a query that
double-counted refunds: the model joined orders to
refunds without deduplicating partial returns. The first draft
looked plausible — correct table names, readable aliases — but failed
on a hidden test suite. A single
chain-of-thought
pass did not catch the join bug. Adding a self-refine loop
(Madaan et al., 2023): generate SQL, run it against a sandbox, feed errors and
row-count mismatches back as feedback, revise — lifted pass rate on 180
held-out questions from 61% to 79% at roughly 2.4× token cost. Self-refine
is not fact-checking policy claims; it is iterative quality improvement
on a single artifact until objective or rubric-based critique says stop.
Unlike self-consistency, which samples independent answers and votes, self-refine edits one draft in place using structured feedback. Unlike chain-of-verification, which decomposes factual claims into isolated Q&A, self-refine optimizes holistic properties: runnable code, tone, length, schema compliance, or aggregate metrics. This guide covers the draft–feedback–revise pattern, Reflexion-style episodic memory for agents, feedback sources (model, tests, humans), the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
What self-refine is
Self-refine is a multi-turn inference pattern with three roles played by the same or different models:
- Generate — produce an initial output (code, SQL, email, plan, summary) from the user prompt and context.
- Feedback — critique the draft against explicit criteria: unit tests, linters, execution traces, rubric scores, or a natural- language review prompt listing defects.
- Refine — revise the draft conditioned on feedback, without discarding useful parts of the prior version unless the critique demands it.
The loop repeats until feedback is empty, a quality threshold is met, or a max iteration cap (typically 2–4 rounds) is hit. No gradient updates occur; improvement comes entirely from additional forward passes and external validators. The pattern generalizes across modalities as long as you can produce machine-readable feedback the model can act on.
Feedback sources: who critiques the draft?
Model-as-critic (self-critique)
The same LLM receives the draft plus a rubric (“List concrete errors in logic, missing edge cases, and style violations”). Cheap to wire; risk that critic and generator share blind spots — both may miss the same join bug. Mitigate with temperature diversity (higher on feedback pass) or a separate judge model.
Executable feedback (strongest for code and SQL)
Run the artifact: pytest, TypeScript compiler, SQL engine against a scratch
schema, JSON schema validator. Failures become verbatim feedback strings
(“column refund_id ambiguous”). Deterministic and
auditable; requires sandboxing and timeout guards.
Reward and outcome signals
Scalar scores from outcome reward models or business metrics (click-through on generated titles). Use when no single test defines correctness but you can rank versions.
Human-in-the-loop refinement
Editors mark up drafts; the model revises from comments. Highest quality for brand-sensitive copy; does not scale to high-volume automation without sampling.
Self-refine vs Reflexion and agent memory
Reflexion (Shinn et al., 2023) extends self-refine across episodes in an agent loop: after a failed web-navigation or coding task, the model writes a short verbal reflection (“I should check the cart total before checkout”) stored in episodic memory and prepended to the next attempt's prompt. Self-refine operates within one user request; Reflexion carries lessons to the next trial of a similar task.
Pair them in production agents: self-refine polishes the current tool call or code block; Reflexion memory reduces repeated mistakes across sessions. Cap memory size and summarize old reflections to avoid context bloat. See agent memory for vector vs scratchpad storage tradeoffs.
Designing effective feedback prompts
Vague feedback (“make it better”) produces vague revisions. Strong feedback prompts are:
- Specific — cite line numbers, failing test names, or rubric dimensions (accuracy, brevity, tone).
- Actionable — each bullet maps to one edit
(“Add
DISTINCTonorder_idbefore join”). - Prioritized — blockers first (syntax errors), polish last (word choice).
- Bounded — cap feedback items (5–8) so refine prompts stay focused.
Use
structured
outputs for feedback JSON:
{"issues": [{"severity": "error", "location": "JOIN", "fix": "..."}]}.
The refine step consumes the same schema so orchestration code can halt on zero
error-severity items.
Self-refine vs other inference patterns
| Approach | Mechanism | Best when | Weak when |
|---|---|---|---|
| Self-refine | Sequential edit from critique | Artifact has testable quality (code, SQL, structured prose) | No objective feedback signal |
| Chain-of-verification | Atomic fact Q&A, then revise claims | Multi-claim factual answers | Holistic style or algorithmic correctness |
| Self-consistency | N independent samples, vote | Discrete answers, diverse reasoning helps | Long artifacts where voting is ill-defined |
| Best-of-N + reward model | Sample N, pick highest score | Cheap scoring, no sequential dependency | Revisions must build on prior partial work |
| Test-time compute search | Tree/beam over partial solutions | Combinatorial planning, math proofs | Simple single-shot fixes suffice |
Stacks combine layers: self-consistency picks among final self-refined candidates; CoVe runs on user-facing prose after SQL passes execution tests. Read test-time compute for the broader inference-scaling menu.
Harbor Analytics SQL assistant refactor
The NL→SQL path before refactor: retrieve schema snippets, one CoT completion, return SQL to the analyst. After refactor:
- Draft — model outputs SQL plus brief rationale in JSON.
- Execute — read-only replica sandbox with 5 s timeout and row-limit guard.
- Feedback — compiler errors, row-count vs expected fixture, or model critic on result shape (“returns 40k rows for a single-SKU question”).
- Refine — up to 3 iterations; each refine sees prior SQL and cumulative feedback, not the full chat history.
- Escalate — if still failing, return best attempt plus error log to the analyst instead of silent wrong answers.
Pass@1 on the 180-question holdout rose 61% → 79%; median latency 1.8 s → 4.2 s. Analyst override rate on auto-generated queries fell from 34% to 19%. Token spend per successful query increased 2.4× but stayed below the cost of a human rewrite for routine reporting questions.
Stopping rules and cost control
Unbounded refine loops burn budget and can oscillate — iteration 2 fixes a join but breaks an alias iteration 3 reintroduces. Production guardrails:
- Max iterations — hard cap (2–4) regardless of feedback.
- Monotonic quality gate — only accept a revision if the reward score or pass count strictly improves; else return the previous best.
- Diminishing returns — stop when feedback severity drops below threshold (only “style” nitpicks remain).
- Route by difficulty — simple intents skip refine; reserve loops for flagged complex schemas or low first-pass confidence.
- Cache refined artifacts — identical prompts with same schema version reuse prior SQL from a versioned cache.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Single-shot generation | High volume, low stakes, strong base model | Executable artifacts with silent failure cost |
| Self-refine (2 rounds) | Code, SQL, JSON, API payloads with validators | Pure creative writing with no rubric |
| Self-refine + separate critic model | Shared blind spots between draft and self-critique | Latency budget under 2 s |
| Reflexion memory | Multi-step agents repeating similar failure modes | One-off stateless Q&A |
| Self-refine + CoVe on prose | User-facing explanations atop generated artifacts | Internal-only machine outputs |
Common pitfalls
- Feedback echoes the draft — critic praises instead of listing defects; force JSON issue lists and forbid empty passes when tests fail.
- Refine sees only feedback, not the draft — model rewrites from scratch and loses working parts; always pass prior artifact.
- No sandbox isolation — generated SQL or code runs on production; use read-only replicas and resource limits.
- Oscillating edits — fix A breaks B, fix B breaks A; keep best-so-far by score and cap iterations.
- Self-critique on facts alone — model rationalizes wrong numbers; pair with CoVe or retrieval for factual tiers.
- Unbounded context growth — stuffing all prior drafts into refine prompts; pass latest draft + cumulative issue summary only.
- Refine on every request — latency and cost explode; route by confidence, intent, or failure history.
- No regression suite — prompt tweaks break refine templates; maintain labeled before/after pairs per iteration.
Production checklist
- Define artifact types and validators (tests, linters, schema, execution).
- Implement generate → feedback → refine with structured issue JSON.
- Set max iterations and monotonic quality gate (keep best-so-far).
- Sandbox all executable feedback with timeouts and row/token limits.
- Separate critic model or higher temperature on feedback pass when needed.
- Log each iteration: draft hash, feedback, revised artifact, scores.
- Route simple queries to single-shot; refine only on low confidence or complexity.
- Benchmark pass rate, p95 latency, and token cost on a held-out set.
- Add Reflexion memory for multi-episode agents with summarization caps.
- Re-run evals after schema, rubric, or model version changes.
Key takeaways
- Self-refine improves a single draft through critique-and-revise loops — best when feedback is objective (tests, execution, rubrics), not when quality is purely subjective.
- Harbor Analytics lifted SQL pass rate from 61% to 79% with up to three sandboxed refine rounds at 2.4× token cost.
- Reflexion adds episodic verbal memory across agent episodes; self-refine polishes within one request — use both in long-horizon agents.
- Stopping rules and best-so-far tracking prevent oscillation and runaway latency; route easy intents away from refine paths.
- Pair self-refine with chain-of-verification for user-facing factual prose; self-consistency for discrete choices among final candidates.
Related reading
- LLM chain-of-verification explained — factual claim decomposition vs holistic artifact polish
- LLM-as-judge explained — separate evaluators for rubric scoring
- LLM test-time compute explained — inference scaling beyond refine loops
- LLM agent memory explained — storing Reflexion-style lessons across sessions