Guide

LLM self-refine explained

Harbor Analytics' natural-language-to-SQL assistant shipped a query that double-counted refunds: the model joined orders to refunds without deduplicating partial returns. The first draft looked plausible — correct table names, readable aliases — but failed on a hidden test suite. A single chain-of-thought pass did not catch the join bug. Adding a self-refine loop (Madaan et al., 2023): generate SQL, run it against a sandbox, feed errors and row-count mismatches back as feedback, revise — lifted pass rate on 180 held-out questions from 61% to 79% at roughly 2.4× token cost. Self-refine is not fact-checking policy claims; it is iterative quality improvement on a single artifact until objective or rubric-based critique says stop.

Unlike self-consistency, which samples independent answers and votes, self-refine edits one draft in place using structured feedback. Unlike chain-of-verification, which decomposes factual claims into isolated Q&A, self-refine optimizes holistic properties: runnable code, tone, length, schema compliance, or aggregate metrics. This guide covers the draft–feedback–revise pattern, Reflexion-style episodic memory for agents, feedback sources (model, tests, humans), the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

What self-refine is

Self-refine is a multi-turn inference pattern with three roles played by the same or different models:

Generate — produce an initial output (code, SQL, email, plan, summary) from the user prompt and context.
Feedback — critique the draft against explicit criteria: unit tests, linters, execution traces, rubric scores, or a natural- language review prompt listing defects.
Refine — revise the draft conditioned on feedback, without discarding useful parts of the prior version unless the critique demands it.

The loop repeats until feedback is empty, a quality threshold is met, or a max iteration cap (typically 2–4 rounds) is hit. No gradient updates occur; improvement comes entirely from additional forward passes and external validators. The pattern generalizes across modalities as long as you can produce machine-readable feedback the model can act on.

Feedback sources: who critiques the draft?

Model-as-critic (self-critique)

The same LLM receives the draft plus a rubric (“List concrete errors in logic, missing edge cases, and style violations”). Cheap to wire; risk that critic and generator share blind spots — both may miss the same join bug. Mitigate with temperature diversity (higher on feedback pass) or a separate judge model.

Executable feedback (strongest for code and SQL)

Run the artifact: pytest, TypeScript compiler, SQL engine against a scratch schema, JSON schema validator. Failures become verbatim feedback strings (“column refund_id ambiguous”). Deterministic and auditable; requires sandboxing and timeout guards.

Reward and outcome signals

Scalar scores from outcome reward models or business metrics (click-through on generated titles). Use when no single test defines correctness but you can rank versions.

Human-in-the-loop refinement

Editors mark up drafts; the model revises from comments. Highest quality for brand-sensitive copy; does not scale to high-volume automation without sampling.

Self-refine vs Reflexion and agent memory

Reflexion (Shinn et al., 2023) extends self-refine across episodes in an agent loop: after a failed web-navigation or coding task, the model writes a short verbal reflection (“I should check the cart total before checkout”) stored in episodic memory and prepended to the next attempt's prompt. Self-refine operates within one user request; Reflexion carries lessons to the next trial of a similar task.

Pair them in production agents: self-refine polishes the current tool call or code block; Reflexion memory reduces repeated mistakes across sessions. Cap memory size and summarize old reflections to avoid context bloat. See agent memory for vector vs scratchpad storage tradeoffs.

Designing effective feedback prompts

Vague feedback (“make it better”) produces vague revisions. Strong feedback prompts are:

Specific — cite line numbers, failing test names, or rubric dimensions (accuracy, brevity, tone).
Actionable — each bullet maps to one edit (“Add DISTINCT on order_id before join”).
Prioritized — blockers first (syntax errors), polish last (word choice).
Bounded — cap feedback items (5–8) so refine prompts stay focused.

Use structured outputs for feedback JSON: {"issues": [{"severity": "error", "location": "JOIN", "fix": "..."}]}. The refine step consumes the same schema so orchestration code can halt on zero error-severity items.

Self-refine vs other inference patterns

Approach	Mechanism	Best when	Weak when
Self-refine	Sequential edit from critique	Artifact has testable quality (code, SQL, structured prose)	No objective feedback signal
Chain-of-verification	Atomic fact Q&A, then revise claims	Multi-claim factual answers	Holistic style or algorithmic correctness
Self-consistency	N independent samples, vote	Discrete answers, diverse reasoning helps	Long artifacts where voting is ill-defined
Best-of-N + reward model	Sample N, pick highest score	Cheap scoring, no sequential dependency	Revisions must build on prior partial work
Test-time compute search	Tree/beam over partial solutions	Combinatorial planning, math proofs	Simple single-shot fixes suffice

Stacks combine layers: self-consistency picks among final self-refined candidates; CoVe runs on user-facing prose after SQL passes execution tests. Read test-time compute for the broader inference-scaling menu.

Harbor Analytics SQL assistant refactor

The NL→SQL path before refactor: retrieve schema snippets, one CoT completion, return SQL to the analyst. After refactor:

Draft — model outputs SQL plus brief rationale in JSON.
Execute — read-only replica sandbox with 5 s timeout and row-limit guard.
Feedback — compiler errors, row-count vs expected fixture, or model critic on result shape (“returns 40k rows for a single-SKU question”).
Refine — up to 3 iterations; each refine sees prior SQL and cumulative feedback, not the full chat history.
Escalate — if still failing, return best attempt plus error log to the analyst instead of silent wrong answers.

Pass@1 on the 180-question holdout rose 61% → 79%; median latency 1.8 s → 4.2 s. Analyst override rate on auto-generated queries fell from 34% to 19%. Token spend per successful query increased 2.4× but stayed below the cost of a human rewrite for routine reporting questions.

Stopping rules and cost control

Unbounded refine loops burn budget and can oscillate — iteration 2 fixes a join but breaks an alias iteration 3 reintroduces. Production guardrails:

Max iterations — hard cap (2–4) regardless of feedback.
Monotonic quality gate — only accept a revision if the reward score or pass count strictly improves; else return the previous best.
Diminishing returns — stop when feedback severity drops below threshold (only “style” nitpicks remain).
Route by difficulty — simple intents skip refine; reserve loops for flagged complex schemas or low first-pass confidence.
Cache refined artifacts — identical prompts with same schema version reuse prior SQL from a versioned cache.

Technique decision table

Approach	Best when	Skip when
Single-shot generation	High volume, low stakes, strong base model	Executable artifacts with silent failure cost
Self-refine (2 rounds)	Code, SQL, JSON, API payloads with validators	Pure creative writing with no rubric
Self-refine + separate critic model	Shared blind spots between draft and self-critique	Latency budget under 2 s
Reflexion memory	Multi-step agents repeating similar failure modes	One-off stateless Q&A
Self-refine + CoVe on prose	User-facing explanations atop generated artifacts	Internal-only machine outputs

Common pitfalls

Feedback echoes the draft — critic praises instead of listing defects; force JSON issue lists and forbid empty passes when tests fail.
Refine sees only feedback, not the draft — model rewrites from scratch and loses working parts; always pass prior artifact.
No sandbox isolation — generated SQL or code runs on production; use read-only replicas and resource limits.
Oscillating edits — fix A breaks B, fix B breaks A; keep best-so-far by score and cap iterations.
Self-critique on facts alone — model rationalizes wrong numbers; pair with CoVe or retrieval for factual tiers.
Unbounded context growth — stuffing all prior drafts into refine prompts; pass latest draft + cumulative issue summary only.
Refine on every request — latency and cost explode; route by confidence, intent, or failure history.
No regression suite — prompt tweaks break refine templates; maintain labeled before/after pairs per iteration.

Production checklist

Define artifact types and validators (tests, linters, schema, execution).
Implement generate → feedback → refine with structured issue JSON.
Set max iterations and monotonic quality gate (keep best-so-far).
Sandbox all executable feedback with timeouts and row/token limits.
Separate critic model or higher temperature on feedback pass when needed.
Log each iteration: draft hash, feedback, revised artifact, scores.
Route simple queries to single-shot; refine only on low confidence or complexity.
Benchmark pass rate, p95 latency, and token cost on a held-out set.
Add Reflexion memory for multi-episode agents with summarization caps.
Re-run evals after schema, rubric, or model version changes.

Key takeaways

Self-refine improves a single draft through critique-and-revise loops — best when feedback is objective (tests, execution, rubrics), not when quality is purely subjective.
Harbor Analytics lifted SQL pass rate from 61% to 79% with up to three sandboxed refine rounds at 2.4× token cost.
Reflexion adds episodic verbal memory across agent episodes; self-refine polishes within one request — use both in long-horizon agents.
Stopping rules and best-so-far tracking prevent oscillation and runaway latency; route easy intents away from refine paths.
Pair self-refine with chain-of-verification for user-facing factual prose; self-consistency for discrete choices among final candidates.