Guide
LLM plan-and-execute explained
Harbor Legal's acquisition due-diligence agent was drowning in context. A single ReAct loop tried to reason about indemnity caps, IP assignments, and change-of-control clauses in one rolling transcript — tool outputs piled up, the model forgot which exhibits it had already parsed, and partners lost trust in the summary memo. Refactoring to plan-and-execute (popularized by LangChain's Plan-and-Execute agent and related research) split roles: a planner LLM drafts a numbered checklist of subtasks (“retrieve MSA section 8”, “compare liability caps across schedules”, “flag missing reps”), and a leaner executor runs one step at a time with fresh context. When retrieval returned an unexpected amendment, a replanning pass inserted a new step instead of derailing the whole run. Mean time to partner-ready memo dropped from 47 minutes to 19, with fewer hallucinated cross-references.
Plan-and-execute is an agent control pattern: separate strategic decomposition from tactical tool use. The planner sees the user goal and available capabilities; the executor sees only the current step plus minimal state. This guide covers planner and executor prompts, structured plan formats, replanning triggers, pairing with agentic RAG and tool-using agents, the Harbor Legal refactor, a technique decision table vs ReAct and tree-of-thought, pitfalls, and a production checklist.
What plan-and-execute is
In a monolithic ReAct agent, the same model alternates “thought” and “action” tokens until the task completes. That works for short tool chains but scales poorly: every observation stays in context, planning noise mixes with execution errors, and long horizons encourage shortcutting.
Plan-and-execute introduces explicit phases:
- Plan — given the objective and tool catalog, the planner outputs an ordered list of steps. Each step is atomic enough for one executor invocation.
- Execute — for step k, the executor receives the step text, prior step summaries (not full raw logs), and tool schemas. It calls tools until the step is marked done.
- Aggregate — step outputs compress into a running state object (facts extracted, files touched, open questions).
- Replan (optional) — if a step fails, returns empty, or surfaces new information, the planner revises remaining steps without restarting from scratch.
- Synthesize — a final pass turns accumulated state into the user-facing answer or artifact.
The pattern mirrors classical AI planning (STRIPS, HTN) but plans are natural language or JSON step lists rather than formal operators. You can use one model for both roles with different system prompts, or a larger model to plan and a smaller/faster model to execute.
Planner design
Inputs the planner needs
- User goal and success criteria (what “done” looks like).
- Tool names and one-line descriptions — not full JSON schemas unless steps reference specific parameters.
- Hard constraints: budget caps, forbidden actions, compliance rules.
- Prior plan and completed steps when replanning.
Plan format
Structured output improves reliability. Common choices:
- JSON array of
{ "id": 1, "step": "...", "depends_on": [] }— easy to parse and mark complete. - Markdown checklist — human-readable in logs; pair with regex or a small parser.
- DAG steps — when steps parallelize (independent document reviews), declare dependencies so the runtime can fan out.
Good steps are verb-first and verifiable: “Search CRM for Acme contracts signed after 2024-01-01” beats “Look into Acme.” The planner should not embed final answers — only actions the executor can perform.
Planning depth
Full upfront planning suits stable environments (batch analytics, document pipelines). Rolling horizon planning generates the next 3–5 steps, executes, then replans — better for web navigation or live APIs where state changes unpredictably. Harbor Legal uses rolling horizons of four steps with mandatory replan after any empty retrieval.
Executor design
The executor is a standard tool-using agent scoped to a single step. Keep its system prompt narrow: “Complete only this step; do not solve the full user request.”
Context packaging
- Current step text and step ID.
- State summary — bullet list of facts from completed steps (not full tool transcripts). Summarize with a cheap model or template.
- Relevant retrieved chunks for this step only when using RAG.
- Remaining plan (optional) so the executor avoids work that belongs to later steps.
Step completion signals
Define explicit done conditions: tool returned rows, validator passed, executor
emits a structured step_result block. Without a stop rule, executors
drift into full-task completion and duplicate later steps.
Failure handling
Classify failures before replanning: transient (retry with backoff), data missing (replan with broader search), tool error (escalate to human). Pair with Reflexion at the executor level for repeated tool mistakes within a step; replan at the planner level when the strategy itself is wrong.
Replanning and state management
Replanning is the main advantage over a frozen script. Trigger replans when:
- A step returns empty or contradictory results.
- The user injects a mid-run correction.
- Estimated remaining steps exceed a token or latency budget.
- An executor reports
blockedwith a reason the planner can address.
The replanner prompt includes: original goal, completed steps with outcomes, failed step, and remaining old plan. Instruct it to preserve completed work, insert new steps, or remove obsolete steps — not to rewrite history.
Maintain a state store outside the chat transcript: key-value facts, citation IDs, file paths. Executors write to the store; the final synthesizer reads from it. This prevents “lost in the middle” when step 9 needs a date extracted in step 2.
Harbor Legal due-diligence refactor
Before plan-and-execute, Harbor's M&A agent used one ReAct thread per deal room. Problems: 120k-token contexts, duplicated clause lookups, and summaries that cited the wrong exhibit number.
After refactor:
- Planner (GPT-4 class) ingests the deal brief and data-room index; outputs JSON steps per document class (MSA, employment, IP).
- Executor (smaller model) runs vector search + clause extractor per step; writes structured fields to a deal state table.
- Replan fires when the index lists an amendment not in the initial plan.
- Synthesizer generates the partner memo from state only — every bullet links to stored citation IDs.
Human reviewers reported 40% fewer “wrong section” flags. Token spend fell because executors no longer re-read the entire thread each step.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Plan-and-execute | Multi-step workflows with clear subgoals; long tool chains; need audit trail of steps | Single-tool lookup; sub-second latency requirements |
| ReAct (interleaved) | Short horizons; highly dynamic env where every action should inform the very next thought | 10+ step pipelines where context collapse causes omissions |
| Tree of Thoughts | Reasoning puzzles with parallel hypotheses to score and prune | Production tool orchestration with side effects |
| Agentic RAG loop only | Answer quality hinges on retrieval iterations, not procedural decomposition | Workflow spans CRM + docs + calendar with ordering dependencies |
| Fixed DAG / workflow engine | Stable SOPs with rare exceptions; compliance needs deterministic paths | User goals vary too much to hard-code steps |
| Human-written playbooks | Regulated procedures with legal sign-off on exact steps | Long tail of deal structures or ticket types |
Common pitfalls
- Plans that are really answers — planner leaks conclusions instead of actions. Enforce “no final recommendation in plan steps.”
- Oversized steps — “Analyze entire data room” is one ReAct thread in disguise. Split until each step fits one executor context window.
- Full log forwarding — passing raw tool JSON from every prior step recreates context bloat. Summarize aggressively.
- No replan budget — infinite replan loops on impossible tasks. Cap replans and escalate.
- Planner without tool awareness — steps reference tools that do not exist or omit required auth. Feed accurate tool metadata.
- Skipping synthesis — dumping step outputs to the user without a final coherence pass produces fragmented answers.
- Same model, same prompt — planner and executor need distinct instructions or the executor re-plans instead of executing.
Production checklist
- Define success criteria and forbidden actions before planning.
- Publish a tool catalog with names, descriptions, and side-effect flags.
- Choose plan format (JSON steps recommended) and validate with a schema.
- Implement executor scoped to one step with explicit completion signal.
- Maintain external state store for facts and citations across steps.
- Summarize completed steps before each new executor call.
- Wire replan triggers for empty results, failures, and user corrections.
- Cap total steps, replans, and token budget per user request.
- Add final synthesizer that reads state store, not raw logs.
- Log plans, step outcomes, and replan diffs for debugging and compliance.
Key takeaways
- Plan-and-execute splits strategy from tactics — planners draft verifiable steps, executors run one at a time with lean context.
- Replanning is the resilience layer — insert, skip, or rewrite remaining steps when the world does not match the initial plan.
- Harbor Legal cut due-diligence memo time from 47 to 19 minutes by summarizing step outputs into a state store instead of one growing ReAct transcript.
- Use rolling horizons for dynamic environments; full upfront plans for stable batch pipelines.
- Pair with agentic RAG inside individual steps and Reflexion for repeated executor mistakes — not as a substitute for planner-level replanning.
Related reading
- AI agents and tool use explained — ReAct loops, tool schemas, and the executor baseline
- Agentic RAG explained — iterative retrieval inside individual plan steps
- Tree of Thoughts reasoning explained — when to search reasoning branches instead of procedural steps
- LLM Reflexion explained — cross-trial memory when executors repeat mistakes