Guide

LLM least-to-most prompting explained

Harbor Logistics runs a freight-routing copilot: given truck capacity, dock windows, and a list of pickup-dropoff pairs, it must produce a feasible multi-stop plan. A single chain-of-thought pass scored 48% on their 120-problem eval — the model tried to optimize everything at once and skipped capacity checks mid-route. Engineers switched to least-to-most (L2M) prompting: first ask the model to decompose the request into ordered subproblems (sort stops by deadline, assign pallets to legs, verify cumulative weight), then solve each subproblem sequentially while injecting answers from earlier steps into the prompt. Accuracy rose to 79%; median latency grew from 2.4 s to 6.8 s across three chained calls. The win came from explicit dependency ordering, not from a larger checkpoint.

Least-to-most prompting is a two-phase inference pattern for problems where later reasoning depends on earlier intermediate results. Phase 1 produces a decomposition plan; phase 2 walks the plan in order, appending each sub-answer to context before the next call. Unlike flat CoT, L2M prevents the model from jumping to a final route before constraints are satisfied. Unlike breadth-first tree-of-thought, it follows one decomposition line rather than exploring parallel branches. This guide covers the decompose-then-solve loop, few-shot exemplar design, static versus dynamic decomposition, the Harbor Logistics refactor, a technique decision table versus CoT and MCTS, pitfalls, and a production checklist.

The decompose-then-solve loop

L2M treats complex reasoning as a pipeline of dependent subquestions. The name “least-to-most” reflects solving simpler subproblems first and feeding their outputs into harder ones — analogous to dynamic programming, but orchestrated through prompts rather than code.

  1. Decompose — given the user problem, the model outputs an ordered list of subproblems. Each item should be answerable in isolation once prior items are resolved.
  2. Solve subproblem 1 — prompt with the original problem plus subproblem 1 only; capture answer A1.
  3. Solve subproblem k — prompt with the original problem, subproblems 1…k, and answers A1…Ak−1; capture Ak.
  4. Synthesize — optional final call that combines sub-answers into the user-facing response (or subproblem n is already the final answer).

Each solve step is a standard LLM completion. You can use different temperatures per phase: low temperature for decomposition (stable plan) and moderate temperature for creative sub-steps if needed.

Few-shot exemplars and ordering constraints

L2M quality depends heavily on decomposition exemplars in the few-shot prefix. Each exemplar should show:

  • The original complex question.
  • A numbered subproblem list where item i only uses information available after items 1…i−1 are solved.
  • Short answers for each subproblem in separate blocks.
  • The final composed answer.

Good decompositions are monotonic: later steps never require facts that only appear in later steps. For routing, decompose as “filter infeasible stops” before “optimize visit order” before “compute arrival timestamps.” For math word problems, decompose as “identify quantities” before “write equations” before “solve.”

Include 2–4 in-domain exemplars in the decomposition prompt. The solve prompts can share a lighter system message (“Answer only the current subproblem using prior answers”) without repeating full exemplars, which saves tokens on long chains.

Static versus dynamic decomposition

Two deployment variants:

  • Static L2M — decomposition runs once upfront; the subproblem list is fixed for the whole request. Cheapest; fails if subproblem 2 reveals subproblem 1 was wrong.
  • Dynamic L2M — after each sub-answer, optionally re-run decomposition on the remaining work (“given A1…Ak, what is the next subproblem?”). Handles surprises at 1.5–2× decomposition cost.
  • Hybrid — static plan with a verifier gate: if sub-answer fails a checker, trigger re-decomposition from that point.

Harbor Logistics uses static L2M for 80% of requests and falls back to dynamic re-decomposition when cumulative weight exceeds truck capacity — a deterministic trigger, not model self-judgment.

Context management across chained calls

Each solve step grows the prompt. Practical patterns:

  • Structured prior-answer block — JSON {subproblem_id, question, answer} list instead of prose summaries; reduces reinterpretation drift.
  • Scratchpad truncation — keep only final sub-answers in context, not full CoT from earlier steps (saves tokens; may lose audit trail).
  • Prompt caching — prefix-share the original problem and exemplars across sub-calls when your provider supports it.
  • Parallel independent subproblems — if the decomposition identifies disjoint branches (two trucks on separate routes), solve branches in parallel then merge; still L2M within each branch.

Cap maximum subproblem count (typically 4–8). Beyond that, accuracy gains flatten while latency and error-compounding rise.

Harbor Logistics refactor (worked example)

Problem template: 1 truck, capacity C, n stops with time windows and pallet weights. Baseline single-shot CoT often produced routes that looked optimal but violated capacity on leg 3.

L2M decomposition prompt (abridged):

Subproblems:
1. List stops that cannot fit alone given min pallet size; mark infeasible.
2. Greedy bin-pack pallets into trip segments without reordering yet.
3. Order stops within each segment by earliest deadline.
4. Simulate drive times; flag window violations.
5. If violations exist, swap adjacent stops and re-check; else output route.

Each sub-call receives prior JSON answers. Step 2’s output is a segment list consumed by step 3; step 4 runs a deterministic drive-time simulator the LLM calls via structured output, not mental math. Results on 120 held-out instances:

  • Single CoT: 48% fully feasible routes.
  • L2M (5 subproblems, static): 79% feasible.
  • Tree-of-thought (beam 3, same token budget): 74% feasible, 2.1× latency.

L2M won on structured constraint satisfaction; ToT helped more on ambiguous priority tradeoffs where multiple valid orderings exist.

Technique decision table

Technique Strengths Weaknesses Best when
Least-to-most Clear dependency chain; easy to debug per step Early sub-errors propagate; multi-call latency Ordered subtasks; verifiable intermediate results
Chain-of-thought One call; low orchestration Skips steps on hard problems Medium difficulty; strong base model
Tree-of-thought Explores alternative decompositions High token cost; complex controller Multiple valid paths; need search
LLM MCTS Adaptive depth; concentrates budget Heavy infra; rollout design Uneven branching; simulator at leaves
Program-aided (PAL) Exact arithmetic via code execution Needs sandbox; code-gen failures Numeric-heavy steps with clear APIs

See test-time compute for how L2M fits alongside self-consistency, best-of-N, and search-based methods in the inference-budget toolbox.

Common pitfalls

  • Bad decomposition order — optimizing before filtering infeasible options; fix exemplars and add ordering rules.
  • Over-long subproblem lists — 12 micro-steps compound latency and transcription errors; merge related steps.
  • No verification between steps — step 3 trusts a wrong step 2; insert deterministic checkers or lightweight validators.
  • Prose prior-answers — the model paraphrases earlier numbers incorrectly; use structured JSON or key-value blocks.
  • Identical prompts per step — the model re-solves the whole problem each call; constrain with “answer ONLY subproblem k.”
  • Ignoring tool hooks — L2M pairs well with calculators and simulators at specific sub-steps rather than pure LLM math.
  • No fallback — when decomposition returns one item (problem already atomic), skip straight to CoT to avoid empty overhead.

Production checklist

  • Curate 2–4 decomposition few-shot exemplars per task type.
  • Document subproblem ordering rules and maximum chain length.
  • Use structured formats for sub-answers passed between calls.
  • Add per-step verifiers where deterministic checks exist.
  • Define re-decomposition triggers (verifier failure, confidence threshold).
  • Log each subproblem prompt, answer, and latency for debugging.
  • Enable prompt caching on shared prefixes if available.
  • Cap wall-clock time; fall back to single CoT on timeout.
  • A/B against flat CoT and ToT at equal token budget before defaulting.
  • Combine with PAL for numeric sub-steps that need exact computation.

Key takeaways

  • L2M decomposes a hard problem into ordered subproblems, then solves each with prior answers in context.
  • Decomposition quality and monotonic ordering matter more than solve-step temperature tuning.
  • Harbor Logistics lifted feasible routing from 48% to 79% with five chained sub-calls.
  • Structured prior-answers and per-step verifiers reduce error propagation.
  • Use ToT or MCTS when multiple decomposition paths need search; use L2M when dependencies are linear.

Related reading