Guide

LLM ReWOO explained

Harbor Procurement's vendor-comparison agent had to pull pricing from three SaaS catalogs, read SOC 2 attestations from a document store, and check payment terms in NetSuite — then produce a one-page recommendation memo. A standard ReAct loop interleaved reasoning with each tool result. Every observation bloated the planner's context, serial tool calls stretched p95 latency to 41 s, and the model sometimes “reasoned around” a missing price instead of scheduling another lookup. Engineers adopted ReWOO (Reasoning Without Observation): the planner drafts a complete script of tool calls and evidence placeholders before any tool runs; workers execute calls in parallel; a separate solver stitches evidence into the final answer. End-to-end latency fell to 25 s; input tokens per request dropped 44%; buyers reported fewer memos that cited the wrong vendor SKU.

ReWOO is a three-role agent pattern introduced in research on decoupling planning from observation. The planner reasons over the user goal and tool catalog only — never over noisy JSON payloads. Workers are thin executors. The solver sees the plan skeleton plus filled evidence slots. This guide covers evidence variables, planner prompts, parallel worker pools, solver synthesis, pairing with parallel tool calling and plan-and-execute, the Harbor Procurement refactor, a technique decision table, pitfalls, and a production checklist.

The planner–worker–solver pipeline

ReWOO splits a tool-using task into three phases with clean interfaces:

Planner — given the user question and tool descriptions, output a reasoning script: natural-language steps that reference evidence variables (e.g. #E1, #E2) instead of live data, plus explicit tool invocations that will populate those variables. The planner never sees tool outputs.
Workers — parse tool calls from the plan, run them (often in parallel when independent), and map results into an evidence dictionary { "#E1": "...", "#E2": "..." }.
Solver — receive the original question, the plan text with variables, and the evidence map; produce the user-facing answer by substituting facts for placeholders and following the planned reasoning chain.

The name “without observation” means the planning LLM call is observation-free: it cannot be distracted by a 4 KB CRM payload mid-thought. Observations enter only at the solver stage, after the strategic structure is fixed. That differs from ReAct, where every observation reshapes the next thought, and from rolling plan-and-execute, where the planner may replan after each step's observation.

Evidence variables and plan format

Evidence variables are named slots the planner assigns to future tool results. A typical plan line looks like:

To compare annual cost I need list prices #E1 from catalog_search(vendor="Acme")
and contract discount tiers #E2 from netsuite_contracts(vendor_id="V-1042").
If #E1 < #E2 baseline I will flag under-discounting in the memo.

Conventions that improve parse reliability:

Unique IDs — #E1 through #En; never reuse IDs across requests.
One tool call per evidence slot when possible — simplifies worker scheduling and failure attribution.
Inline tool syntax — either JSON blocks ({"tool":"catalog_search","args":{...},"evidence":"#E1"}) or a rigid micro-format your parser enforces.
Conditional branches as comments — the planner may write “if #E3 contains 'SOC2 Type II' then ...” but should not assume values; the solver evaluates branches after substitution.

Keep planner output under a token budget (often 300–800 tokens). Long plans recreate the verbosity ReWOO is meant to avoid. If decomposition needs more than 6–8 evidence slots, consider splitting into sub-questions or switching to rolling plan-and-execute.

Worker execution and parallelism

Workers are intentionally dumb: validate tool names, check auth scopes, call APIs, truncate or compress oversized payloads, write into the evidence map. They do not re-reason about the user goal.

Dependency graph

Independent tool calls — catalog search for vendor A and vendor B — should run concurrently via a worker pool. Dependent calls — fetch contract ID from #E1 then load PDF from #E4 — require a second worker wave after the first completes. Build a DAG from the plan parser; topological sort determines waves.

Failure modes

Empty evidence — store explicit null or ERROR: timeout strings; the solver prompt must handle missing slots (“state unknown” beats hallucination).
Partial plans — if the planner emits a tool name typo, fail fast before workers run; do not let the solver invent data.
Rate limits — parallel workers need per-API semaphores; ReWOO's parallelism is a feature only when infra supports it.

Harbor Procurement runs two worker waves: wave 1 hits three catalog APIs in parallel; wave 2 loads compliance PDFs only for vendors whose wave 1 quote beat the incumbent by more than 8%.

Solver design

The solver is usually one LLM call (sometimes two if you separate formatting). Inputs:

Original user question.
Planner script with evidence variables still visible (shows intended logic).
Evidence dictionary with substituted values.
Output schema: memo sections, JSON, or bullet constraints.

Instruct the solver to follow the plan's logic but ground every claim in evidence. If a branch condition cannot be evaluated because #E2 is empty, it must say so. Optional: require citation tags like [#E3] after each factual sentence for audit.

Use a capable model for the solver even if the planner is smaller — synthesis across conflicting quotes is harder than scheduling tool calls. Temperature low (0–0.3) for procurement and analytics; moderate only for creative tasks.

Harbor Procurement refactor

Baseline: ReAct agent with five tools (catalog_search, netsuite_contracts, doc_store_query, fx_rate, send_slack_summary). Typical request: compare two backup vendors for a 500-seat renewal.

Pain points: serial tool latency, planner context exceeded 32k tokens on large SOC reports pasted inline, and the model once recommended Vendor B using Vendor A's price from two observations earlier.

ReWOO refactor:

Planner outputs 5–7 evidence slots and conditional memo outline in one call (~600 tokens).
Workers run catalog and NetSuite lookups in parallel; doc_store fetches run in wave 2 for shortlisted vendors only.
Worker compression keeps each evidence value under 1,200 tokens (tables summarized to top-line numbers).
Solver generates the recommendation memo with mandatory [#En] citations; compliance reviews pass rate rose from 71% to 93%.

Metrics on 80 held-out comparison requests (equal tool set, same base model):

ReAct: p50 latency 34 s, p95 41 s, mean input tokens 28.4k.
ReWOO: p50 22 s, p95 25 s, mean input tokens 15.9k.
Factual error rate (human audit): 12% ReAct vs 5% ReWOO.

Gains came from parallel I/O and keeping bulky observations out of the planner — not from a smarter checkpoint.

Technique decision table

Approach	Strengths	Weaknesses	Best when
ReWOO	Parallel tools; planner stays lean; predictable evidence map	Fixed plan before data; weak on surprises	Known tool set; mostly independent lookups; cost/latency sensitive
ReAct	Each observation informs next action; handles surprises	Serial tools; context bloat; reasoning drift	Short chains; highly dynamic environments
Plan-and-execute	Replanning after observations; step summaries	More LLM rounds than ReWOO; planner may still see observations on replan	Long workflows with dependencies and mid-run corrections
Agentic RAG loop	Iterative retrieval refinement	Not optimized for multi-system orchestration	Answer quality hinges on search, not CRM + ERP joins
Fixed workflow DAG	Deterministic; cheap	Inflexible for novel questions	Stable SOPs with rare exceptions

Hybrid pattern: use ReWOO for the first evidence-gathering wave, then a small ReAct replan if a critical slot is empty — captures most token savings without abandoning recovery.

Common pitfalls

Planner assumes evidence values — writing “since Acme costs $12/seat” before #E1 exists; enforce placeholder-only reasoning in the planner prompt.
Unparseable plans — free-form prose without tool blocks; workers cannot run. Use schema validation or few-shot plan exemplars.
Oversized evidence — dumping full PDF text into #E3 negates token savings; compress at the worker boundary.
Solver ignores plan — model answers from parametric knowledge; require citations and reject outputs missing [#En] tags.
No empty-slot policy — solver hallucinates when tools fail; document explicit “unknown” handling.
Parallel blast radius — firing 20 concurrent API calls trips rate limits; cap concurrency per integration.
Wrong fit — exploratory web research where each page should change the next query; use ReAct or agentic RAG instead.

Production checklist

Publish tool catalog with names, args, and side-effect flags to the planner.
Define evidence variable syntax and validate planner output with a schema.
Implement worker DAG scheduler with per-API concurrency limits.
Compress or truncate tool results before writing evidence slots.
Pass planner script + evidence map to solver; require grounded citations.
Handle null evidence explicitly in solver system prompt.
Log plan, evidence map, and solver output for debugging and compliance.
Cap evidence slot count and planner tokens; escalate to plan-and-execute if exceeded.
Add hybrid ReAct fallback when critical evidence is missing after wave 2.
A/B latency and token spend against ReAct on representative traffic before cutover.

Key takeaways

ReWOO plans tool use before any observation reaches the planning LLM — workers fill evidence slots, then a solver synthesizes the answer.
Evidence variables (#E1, #E2, …) make plans parseable and failures attributable.
Parallel independent workers cut Harbor Procurement p95 latency from 41 s to 25 s and input tokens by 44%.
Prefer ReWOO for multi-lookup comparisons; prefer ReAct or replanning executors when each observation should change the next action.
Compress evidence at workers and require solver citations — otherwise token savings return as hallucination risk.