Guide

LLM plan-and-execute explained

Harbor Legal's acquisition due-diligence agent was drowning in context. A single ReAct loop tried to reason about indemnity caps, IP assignments, and change-of-control clauses in one rolling transcript — tool outputs piled up, the model forgot which exhibits it had already parsed, and partners lost trust in the summary memo. Refactoring to plan-and-execute (popularized by LangChain's Plan-and-Execute agent and related research) split roles: a planner LLM drafts a numbered checklist of subtasks (“retrieve MSA section 8”, “compare liability caps across schedules”, “flag missing reps”), and a leaner executor runs one step at a time with fresh context. When retrieval returned an unexpected amendment, a replanning pass inserted a new step instead of derailing the whole run. Mean time to partner-ready memo dropped from 47 minutes to 19, with fewer hallucinated cross-references.

Plan-and-execute is an agent control pattern: separate strategic decomposition from tactical tool use. The planner sees the user goal and available capabilities; the executor sees only the current step plus minimal state. This guide covers planner and executor prompts, structured plan formats, replanning triggers, pairing with agentic RAG and tool-using agents, the Harbor Legal refactor, a technique decision table vs ReAct and tree-of-thought, pitfalls, and a production checklist.

What plan-and-execute is

In a monolithic ReAct agent, the same model alternates “thought” and “action” tokens until the task completes. That works for short tool chains but scales poorly: every observation stays in context, planning noise mixes with execution errors, and long horizons encourage shortcutting.

Plan-and-execute introduces explicit phases:

Plan — given the objective and tool catalog, the planner outputs an ordered list of steps. Each step is atomic enough for one executor invocation.
Execute — for step k, the executor receives the step text, prior step summaries (not full raw logs), and tool schemas. It calls tools until the step is marked done.
Aggregate — step outputs compress into a running state object (facts extracted, files touched, open questions).
Replan (optional) — if a step fails, returns empty, or surfaces new information, the planner revises remaining steps without restarting from scratch.
Synthesize — a final pass turns accumulated state into the user-facing answer or artifact.

The pattern mirrors classical AI planning (STRIPS, HTN) but plans are natural language or JSON step lists rather than formal operators. You can use one model for both roles with different system prompts, or a larger model to plan and a smaller/faster model to execute.

Planner design

Inputs the planner needs

User goal and success criteria (what “done” looks like).
Tool names and one-line descriptions — not full JSON schemas unless steps reference specific parameters.
Hard constraints: budget caps, forbidden actions, compliance rules.
Prior plan and completed steps when replanning.

Plan format

Structured output improves reliability. Common choices:

JSON array of { "id": 1, "step": "...", "depends_on": [] } — easy to parse and mark complete.
Markdown checklist — human-readable in logs; pair with regex or a small parser.
DAG steps — when steps parallelize (independent document reviews), declare dependencies so the runtime can fan out.

Good steps are verb-first and verifiable: “Search CRM for Acme contracts signed after 2024-01-01” beats “Look into Acme.” The planner should not embed final answers — only actions the executor can perform.

Planning depth

Full upfront planning suits stable environments (batch analytics, document pipelines). Rolling horizon planning generates the next 3–5 steps, executes, then replans — better for web navigation or live APIs where state changes unpredictably. Harbor Legal uses rolling horizons of four steps with mandatory replan after any empty retrieval.

Executor design

The executor is a standard tool-using agent scoped to a single step. Keep its system prompt narrow: “Complete only this step; do not solve the full user request.”

Context packaging

Current step text and step ID.
State summary — bullet list of facts from completed steps (not full tool transcripts). Summarize with a cheap model or template.
Relevant retrieved chunks for this step only when using RAG.
Remaining plan (optional) so the executor avoids work that belongs to later steps.

Step completion signals

Define explicit done conditions: tool returned rows, validator passed, executor emits a structured step_result block. Without a stop rule, executors drift into full-task completion and duplicate later steps.

Failure handling

Classify failures before replanning: transient (retry with backoff), data missing (replan with broader search), tool error (escalate to human). Pair with Reflexion at the executor level for repeated tool mistakes within a step; replan at the planner level when the strategy itself is wrong.

Replanning and state management

Replanning is the main advantage over a frozen script. Trigger replans when:

A step returns empty or contradictory results.
The user injects a mid-run correction.
Estimated remaining steps exceed a token or latency budget.
An executor reports blocked with a reason the planner can address.

The replanner prompt includes: original goal, completed steps with outcomes, failed step, and remaining old plan. Instruct it to preserve completed work, insert new steps, or remove obsolete steps — not to rewrite history.

Maintain a state store outside the chat transcript: key-value facts, citation IDs, file paths. Executors write to the store; the final synthesizer reads from it. This prevents “lost in the middle” when step 9 needs a date extracted in step 2.

Harbor Legal due-diligence refactor

Before plan-and-execute, Harbor's M&A agent used one ReAct thread per deal room. Problems: 120k-token contexts, duplicated clause lookups, and summaries that cited the wrong exhibit number.

After refactor:

Planner (GPT-4 class) ingests the deal brief and data-room index; outputs JSON steps per document class (MSA, employment, IP).
Executor (smaller model) runs vector search + clause extractor per step; writes structured fields to a deal state table.
Replan fires when the index lists an amendment not in the initial plan.
Synthesizer generates the partner memo from state only — every bullet links to stored citation IDs.

Human reviewers reported 40% fewer “wrong section” flags. Token spend fell because executors no longer re-read the entire thread each step.

Technique decision table

Approach	Best when	Skip when
Plan-and-execute	Multi-step workflows with clear subgoals; long tool chains; need audit trail of steps	Single-tool lookup; sub-second latency requirements
ReAct (interleaved)	Short horizons; highly dynamic env where every action should inform the very next thought	10+ step pipelines where context collapse causes omissions
Tree of Thoughts	Reasoning puzzles with parallel hypotheses to score and prune	Production tool orchestration with side effects
Agentic RAG loop only	Answer quality hinges on retrieval iterations, not procedural decomposition	Workflow spans CRM + docs + calendar with ordering dependencies
Fixed DAG / workflow engine	Stable SOPs with rare exceptions; compliance needs deterministic paths	User goals vary too much to hard-code steps
Human-written playbooks	Regulated procedures with legal sign-off on exact steps	Long tail of deal structures or ticket types

Common pitfalls

Plans that are really answers — planner leaks conclusions instead of actions. Enforce “no final recommendation in plan steps.”
Oversized steps — “Analyze entire data room” is one ReAct thread in disguise. Split until each step fits one executor context window.
Full log forwarding — passing raw tool JSON from every prior step recreates context bloat. Summarize aggressively.
No replan budget — infinite replan loops on impossible tasks. Cap replans and escalate.
Planner without tool awareness — steps reference tools that do not exist or omit required auth. Feed accurate tool metadata.
Skipping synthesis — dumping step outputs to the user without a final coherence pass produces fragmented answers.
Same model, same prompt — planner and executor need distinct instructions or the executor re-plans instead of executing.

Production checklist

Define success criteria and forbidden actions before planning.
Publish a tool catalog with names, descriptions, and side-effect flags.
Choose plan format (JSON steps recommended) and validate with a schema.
Implement executor scoped to one step with explicit completion signal.
Maintain external state store for facts and citations across steps.
Summarize completed steps before each new executor call.
Wire replan triggers for empty results, failures, and user corrections.
Cap total steps, replans, and token budget per user request.
Add final synthesizer that reads state store, not raw logs.
Log plans, step outcomes, and replan diffs for debugging and compliance.

Key takeaways

Plan-and-execute splits strategy from tactics — planners draft verifiable steps, executors run one at a time with lean context.
Replanning is the resilience layer — insert, skip, or rewrite remaining steps when the world does not match the initial plan.
Harbor Legal cut due-diligence memo time from 47 to 19 minutes by summarizing step outputs into a state store instead of one growing ReAct transcript.
Use rolling horizons for dynamic environments; full upfront plans for stable batch pipelines.
Pair with agentic RAG inside individual steps and Reflexion for repeated executor mistakes — not as a substitute for planner-level replanning.