Guide

LLM Reflexion explained

Harbor Support's internal deployment agent kept failing the same staging checklist: it ran database migrations before verifying the feature flag was enabled, rolled back cleanly on the first error but then repeated the same ordering mistake on the next ticket. Fine-tuning on 400 past runbooks helped tone but not procedure. Adding a Reflexion loop (Shinn et al., 2023) — after each failed run the agent writes a short verbal reflection (“Always confirm flag state in /admin/flags before migrate up”), stores it in episodic memory, and prepends accumulated lessons to the next attempt — raised first-try success on 52 recurring playbooks from 38% to 71% without any gradient update.

Reflexion is verbal reinforcement learning: the model learns from trial outcomes by generating natural-language critiques of its own behavior and reusing them as context on subsequent trials. Unlike self-refine, which polishes one artifact inside a single user request, Reflexion operates across episodes — failed web navigation, coding tasks, or tool chains where the environment returns a binary or scalar success signal. This guide covers the actor–evaluator–self-reflection loop, episodic memory design, pairing with tool-using agents, the Harbor Support refactor, a technique decision table vs self-refine and RLHF, pitfalls, and a production checklist — building on agent memory fundamentals.

What Reflexion is

Classical reinforcement learning updates neural weights from numeric rewards. Reflexion keeps weights frozen and instead treats language as the learning signal. After each trial:

  1. Act — the agent (actor LLM) executes a plan: ReAct tool calls, browser actions, or multi-step code edits against an environment with observable state.
  2. Evaluate — the environment or a separate evaluator returns success/failure, test results, or a score. No gradient flows from this signal.
  3. Reflect — on failure (and optionally on partial success), the model writes a concise self-reflection: what went wrong, which assumption was false, what to try differently.
  4. Remember — reflections append to an episodic memory buffer (often a scratchpad list, sometimes embedded for retrieval).
  5. Retry — the next trial on the same or similar task prepends memory to the prompt so the actor starts with prior lessons.

The insight from the original paper: LLMs already encode procedural knowledge; explicit verbal feedback nudges sampling toward better action sequences without expensive fine-tuning. Gains compound when tasks share structure (navigation, API orchestration, unit-test-driven coding) even if surface details differ.

The Reflexion loop in detail

Actor policy

The actor is typically the same chat model used for agentic tool use, prompted with task description, tool schemas, current observations, and the episodic memory block. Temperature is moderate (0.2–0.7): low enough for reliable tool JSON, high enough to explore alternate plans after reflection.

Environment feedback

Reflexion needs ground-truth or executable feedback, not vibes. Examples: unit tests pass/fail, web benchmark success bit, SQL row count match, CI pipeline green/red. Scalar rewards work but binary failure triggers are easier to wire. The evaluator can be code; the reflector is always language.

Self-reflection prompt

Effective reflection prompts ask for structured prose, not apologies:

  • What was the intended outcome vs what happened?
  • Which specific action or assumption caused failure?
  • What concrete rule should guide the next attempt?

Cap reflection length (50–150 tokens) to control context growth. Ban generic advice (“be more careful”) in the prompt template.

Episode limits

Papers often allow 3–5 trials per task. Production agents need hard caps plus backoff to humans. Track diminishing returns: if trial 3 repeats trial 2's mistake, escalate rather than loop forever.

Episodic memory design

Memory is the persistence layer that makes Reflexion more than repeated sampling. Three common patterns:

Scratchpad list (simplest)

Append each reflection as a bullet in fixed order. Prepend the full list to every new trial on the same task ID. Works for short episodes; context grows linearly. Summarize when bullets exceed N items or M tokens.

Task-conditioned retrieval

Embed reflections; on a new task, retrieve top-k memories by similarity to task description or tool subgraph. Reduces noise when the agent handles diverse tickets. See agent memory for vector store tradeoffs.

Hierarchical summaries

After K failures on related tasks, compress bullets into a single “playbook addendum” paragraph. Harbor Support uses this nightly to merge deployment reflections into a versioned runbook snippet engineers can audit.

What not to store: full action traces unless debugging. Reflections should be distilled lessons, not replay logs. Strip secrets and PII from memory before cross-ticket reuse.

Reflexion vs self-refine vs RLHF

Teams conflate these because all improve behavior without a user rewriting prompts each time. The boundaries matter for architecture:

  • Self-refine — same request, same artifact, multiple draft–critique–revise turns until tests pass. No cross-episode memory required.
  • Reflexion — multiple trials, environment reset between attempts, verbal memory carries forward. Best for sequential decision problems.
  • RLHF / DPO — offline weight updates from preference data. Amortizes learning across all users but needs datasets and GPU training. Reflexion is inference-time only.

Production stacks often combine them: Reflexion memory reduces repeated mistakes across tickets; self-refine polishes the current SQL or config file before submission; periodic DPO on logged preferences slowly shifts base behavior.

Harbor Support deployment agent refactor

Harbor's agent orchestrates five tools: fetch ticket context, read feature flags, run migrations, deploy container revision, post Slack summary. Before Reflexion, failures clustered on ordering and missing prechecks.

Changes shipped

  • Binary evaluator — staging smoke test must return HTTP 200 on health endpoints within 120s; anything else is failure.
  • Reflection template — three sentences max, must cite tool name and observed error string.
  • Per-playbook memory — scratchpad keyed by playbook_id; cleared after success or human takeover.
  • Self-refine on artifacts — generated shell scripts still pass through a sandbox refine loop before execution (orthogonal layer).

Results

First-try success 38% → 71% on 52 playbooks over two weeks. Median trials dropped from 2.8 to 1.4. Token cost per successful deploy rose ~1.9× due to reflection passes on failures, but engineer intervention time fell 44%. False-success rate (green smoke test, broken prod) unchanged — Reflexion does not replace monitoring.

Technique decision table

Approach Best when Skip when
Reflexion episodic memory Multi-step agents with resettable envs; repeated failure modes across similar tasks One-shot Q&A; no reliable success signal
Self-refine within episode Single artifact must pass tests (SQL, code, config) before commit Pure planning with no editable draft
ReAct without memory Low-latency tool chains; failures are rare and random Agents retry the same mistake on every ticket
RLHF / DPO fine-tune Stable preference signal at scale; budget for training pipelines Rapidly changing tools/APIs; need same-day behavior change
Process reward models Step-level scoring for reasoning chains Sparse env feedback already sufficient for reflect trigger
Human-written playbooks only Regulated procedures with legal sign-off Long tail of edge cases engineers cannot enumerate

Common pitfalls

  • Reflecting without ground truth — model rationalizes failures incorrectly. Always anchor reflection to evaluator output, not self-judgment alone.
  • Memory pollution — wrong lessons persist and bias future trials. Expire memory on playbook version bumps; let humans delete bad reflections.
  • Context bloat — unbounded scratchpads crowd out tool schemas. Summarize or retrieve top-k reflections.
  • Overfitting to one task — hyper-specific reflections (“click button 7”) do not generalize. Prompt for transferable rules.
  • Infinite retry loops — cap trials; detect oscillation (A → B → A action patterns).
  • Skipping self-refine — Reflexion fixes strategy; executable artifacts may still need in-episode polish before run.
  • Storing secrets in memory — reflections echo env vars or tokens from logs. Redact before append.

Production checklist

  • Define a reliable binary or scalar evaluator tied to business outcome.
  • Implement actor loop with tool schemas and observation formatting.
  • Write reflection prompt template with length cap and specificity requirements.
  • Choose scratchpad vs vector memory keyed by task or domain.
  • Set max trials per task and escalation path to humans.
  • Summarize or prune memory when token budget exceeds threshold.
  • Log trials, reflections, and outcomes for offline review.
  • Pair with self-refine on generated code or config before execution.
  • Version playbook memory when tools or APIs change.
  • Benchmark success rate, trials-to-success, and token cost on held-out tasks.

Key takeaways

  • Reflexion is verbal RL at inference time — agents learn from failures by writing and reusing short self-reflections, not gradient updates.
  • Episodic memory is the product feature: distill lessons, cap size, and key memory to task families so reflections generalize.
  • Harbor Support raised first-try deploy success from 38% to 71% with binary smoke-test evaluators and per-playbook scratchpads.
  • Reflexion complements self-refine (cross-episode strategy) and RLHF (offline preference training) — use the right layer for the failure mode.
  • Ground reflections in evaluator output; unanchored self-critique invents false lessons that haunt future trials.

Related reading