Guide

LLM prompt chaining explained

Harbor Legal's first contract-review assistant used a single 4,200-token prompt: read the NDA, list risks, suggest redlines, and output a client memo — all in one completion. It worked on clean templates. On scanned vendor agreements with nested indemnity clauses, the model invented liability caps that did not exist in 14% of reviews and buried genuine non-compete issues under generic boilerplate warnings. Paralegals stopped trusting the tool after two escalation incidents.

Prompt chaining splits a complex task into a sequence of smaller LLM calls, each with a narrow objective, typed output, and validation gate before the next step runs. Harbor replaced the monolith with a four-link chain: clause extraction, obligation classification, risk scoring, and memo synthesis — with JSON schema checks between links. False-positive risk flags fell 41%; attorneys could see which clause triggered each flag. This guide covers what prompt chaining is, chain topologies, state passing, validation gates, the Harbor refactor, a technique decision table versus monolithic prompts and agent loops, pitfalls, and a checklist — alongside our guides on context engineering, output parsing and validation, and self-refine loops.

What prompt chaining is

A prompt chain is a directed workflow where the output of one LLM call becomes structured input to the next. Each step has a single responsibility: extract entities, classify intent, summarize a section, translate format, or synthesize a final answer. Chains differ from one-shot prompting because you engineer interfaces between steps — not just a longer instruction block.

Prompt chaining sits between two extremes:

  • Monolithic prompts — one call does everything; simple to ship but context dilution and error compounding rise with task complexity.
  • Agent loops — the model chooses tools and next actions dynamically; flexible but harder to test, audit, and cost-predict.

Chains trade some flexibility for deterministic structure. You know exactly how many model calls run, what each step receives, and where validation fires. That predictability matters in regulated workflows like contract review, medical chart summarization, and financial report generation.

Chain topologies

Linear chains

The simplest pattern: A → B → C → D. Each step consumes the previous step's output plus any shared context from the original request. Harbor Legal's NDA pipeline is linear: extract clauses, classify obligations, score risks, write memo. Linear chains are easy to debug because you can replay any step in isolation with saved intermediate state.

Map-reduce chains

When input exceeds a single context window or benefits from parallel focus, split the document into chunks (map), run the same prompt on each chunk, then aggregate results (reduce). A 120-page MSA might map 20 sections through an obligation extractor, then reduce into a consolidated risk matrix. Watch for duplicate entities across chunks — the reduce step should deduplicate and reconcile conflicts.

Router chains

A lightweight classifier step routes input to specialized sub-chains. Harbor routes NDAs, MSAs, and employment agreements to different classification prompts because obligation taxonomies differ. Routers can be rule-based (filename patterns), embedding similarity, or a small LLM call. Keep router prompts tiny and eval-heavy; a misroute poisons every downstream step.

Branching and conditional chains

Steps can skip or fork based on prior output: if risk score exceeds a threshold, invoke a deeper legal-reasoning chain; otherwise emit a standard summary. Conditional branches add latency only when needed but require explicit branch coverage in regression tests.

State passing and context design

The quality of a chain depends less on individual prompt prose and more on what you pass between steps. Treat each link as an API contract:

  • Typed outputs. Prefer JSON or Pydantic schemas over free text. Downstream prompts should receive structured fields, not prose dumps to re-parse.
  • Minimal carry-forward. Pass only fields the next step needs. Carrying full prior completions bloats tokens and reintroduces lost-in-the-middle noise.
  • Immutable originals. Keep the source document or user query in a read-only slot every step can reference; never let intermediate steps overwrite source text.
  • Provenance tags. Attach clause IDs, page numbers, or chunk indices so synthesis steps can cite sources and auditors can trace flags.
  • Version stamps. Record prompt version and model ID per step for reproducibility when you roll back a bad prompt edit.

Good chain design mirrors context engineering principles: assemble the smallest sufficient context per call rather than forwarding the entire conversation history.

Validation gates between steps

A chain without gates is just a longer way to fail. Insert deterministic checks between LLM calls:

  • Schema validation — reject malformed JSON before the next prompt sees it; trigger repair loops or human review.
  • Semantic validators — rule engines that flag impossible outputs (indemnity cap cited but no cap clause in extraction).
  • Confidence thresholds — if classifier confidence is below 0.7, route to a fallback model or escalate.
  • Token and cost budgets — abort chains that exceed per-request spend caps.
  • Idempotency keys — cache intermediate results so retries do not double-bill on transient failures.

Harbor Legal added a cross-check gate after extraction: every cited clause ID must exist in the parsed document tree. Hallucinated clause references dropped from 14% to under 2%. See our output parsing guide for repair-loop patterns when validation fails.

Harbor Legal contract review refactor

Before: One prompt, one completion, unstructured memo output. Attorneys could not see which clause drove a flag. Re-runs produced different risk rankings because the model re-interpreted the whole document each time.

After: Four-link chain with persisted intermediate JSON:

  1. Extract — segment document into clauses with stable IDs, page anchors, and verbatim text spans.
  2. Classify — label each clause (indemnity, termination, IP assignment, non-compete, etc.) with confidence scores.
  3. Score — apply playbook rules per clause type; output structured risk objects with severity and playbook section references.
  4. Synthesize — generate attorney memo from scored objects only; synthesis prompt forbidden from introducing new legal claims.

Each link uses a smaller, cheaper model except synthesis, which runs on a larger model with only ~800 tokens of structured input. Total cost per review fell 22% despite four calls, because extraction and classification no longer needed the flagship model. Mean time-to-review dropped from 47 seconds to 31 seconds with parallel map on long MSAs.

Prompt versions are pinned in a registry (see our prompt versioning guide); eval gates block promotion when extraction F1 drops below 0.92 on the golden NDA set.

Technique decision table

Approach Best when Strength Weakness
Monolithic one-shot prompt Simple Q&A, <2k tokens, low stakes Fast to build; one latency hop Error compounding; hard to debug; context dilution
Linear prompt chain Multi-stage workflows with clear dependencies Testable steps; typed interfaces; cost routing per step Fixed topology; cannot adapt mid-flight
Map-reduce chain Long documents, parallel section analysis Scales past context limits; parallelizable Reduce conflicts; duplicate entity handling
Router + sub-chains Heterogeneous input types (NDA vs MSA vs offer letter) Specialized prompts per domain Router errors cascade; needs per-route evals
ReAct agent loop Exploratory tasks, dynamic tool selection Adapts to surprises; uses external tools Unpredictable cost; harder compliance audit
Self-refine loop Quality-critical drafts (copy, code, summaries) Iterative improvement on one artifact Multiplies latency; critique may drift
Plan-and-execute Multi-step projects with replanning needs Separates planning from execution Planner hallucinations; complex orchestration

Use chains when the workflow decomposes cleanly and auditability matters. Reach for ReAct loops when the path cannot be known upfront. Combine chains with self-refine on the final synthesis step when polish matters more than speed.

Common pitfalls

  • Chains that are just prompt concatenation. If step two re-parses prose from step one, you have a longer monolith with extra latency — not a chain.
  • No validation between links. Garbage propagates; fix at the cheapest step, not at the end.
  • Over-sharing context. Forwarding full prior completions negates token savings and reintroduces attention dilution.
  • Same model everywhere. Extraction often works on small models; burning GPT-4-class tokens on segmentation wastes budget.
  • Missing provenance. Synthesis without clause IDs produces un-auditable legal or medical advice.
  • Untested routers. A 5% misroute rate across three sub-chains compounds into unacceptable error rates.
  • No partial-failure handling. If map step 7 of 12 fails, the chain should degrade gracefully, not abort the entire job.
  • Skipping eval per step. End-to-end accuracy hides which link regressed after a prompt edit.

Production checklist

  • Decompose the task into single-responsibility steps with named inputs and outputs.
  • Define JSON schemas (or equivalent) for every inter-step interface.
  • Insert schema and semantic validation gates between each LLM call.
  • Pass minimal structured state; keep source documents in an immutable slot.
  • Attach provenance IDs (clause, chunk, page) to every extracted object.
  • Route cheap steps to smaller models; reserve large models for synthesis.
  • Version and pin prompts per chain link; block deploy on per-step eval regression.
  • Log intermediate JSON for debugging; redact PII per retention policy.
  • Implement retry with idempotency keys on transient failures per step.
  • Set per-chain and per-step token/cost budgets with hard abort thresholds.
  • Build golden datasets per link, not only end-to-end integration tests.
  • Document degrade paths when map shards or router branches fail.

Key takeaways

  • Prompt chaining splits complex tasks into typed, testable LLM steps with explicit interfaces between calls.
  • Linear, map-reduce, and router topologies cover most production workflows; pick topology from document shape and input heterogeneity.
  • Validation gates between links stop hallucinations from propagating — Harbor Legal cut false risk flags 41% with schema and provenance checks.
  • Chains beat monoliths on auditability and cost routing; agents beat chains when the path cannot be predetermined.
  • Eval per step, not only end-to-end — regressions hide inside individual links after prompt edits.

Related reading