Guide

LLM agent planning and task decomposition explained

Harbor DevOps’ release agent ran as a pure ReAct loop: each step chose the next tool call from scratch. On a routine blue-green deploy it applied a database migration, started canary traffic, noticed a health-check failure, rolled back the app tier — but left the migration applied. The next deploy attempt hit schema drift; on-call engineers spent forty minutes reconciling state. Partial-failure incidents on agent-driven releases hit 34% over ninety days. The model wasn’t dumb; it lacked a plan that encoded dependencies, rollback scope, and verification gates before irreversible steps ran.

Task decomposition breaks a user goal into ordered, checkable subtasks. Planning materializes that decomposition as a durable artifact the runtime can execute, audit, and replan against. Replacing ad-hoc ReAct with a typed plan DAG, explicit preconditions, and post-step validators cut Harbor’s partial-failure rate to 6% and shaved 22 minutes off mean time to safe deploy. This guide covers upfront vs reactive planning, hierarchical task networks, plan schema design, plan-act-verify loops, replanning triggers, integration with subagent delegation and checkpointing, the Harbor DevOps refactor, a technique decision table, pitfalls, and a production checklist.

Planning vs reactive tool loops

A reactive agent asks “what should I do next?” after every observation. That works for short, reversible tasks. It fails when steps have hidden dependencies (migrate before scale-up), irreversible side effects (send customer email, delete S3 prefix), or parallelizable sub-work that one thread cannot hold in working memory. Planning front-loads structure: the agent (or a dedicated planner model) emits a graph of steps with inputs, outputs, success criteria, and rollback hints before expensive execution begins.

Plan-act-verify (PAV)

Production agents rarely run a plan blindly. Plan-act-verify alternates three phases: (1) produce or refresh a plan slice, (2) execute one or more steps, (3) run validators — automated checks, smaller judge models, or human approval — before advancing. Failed verification triggers replanning on the remaining subgraph rather than retrying the same tool call indefinitely. This pairs naturally with loop termination rules: stagnation detectors watch for plans that oscillate between the same two failed steps.

When planning pays off

Five or more steps with ordering constraints
Mix of read-only research and mutating actions
Human-visible milestones (“here is what I will do”)
Compliance or change-management audit trails
Multi-agent workflows where children need scoped briefs

Skip heavy planning for single-tool lookups, one-shot summarization, or tightly bounded chat turns where latency dominates and rollback is trivial.

Decomposition strategies

Decomposition is the art of splitting a goal without losing the intent. Common patterns:

Sequential pipeline

Linear steps: gather requirements → draft → review → publish. Simple to execute; poor fit when middle steps can run in parallel.

Hierarchical task network (HTN)

Compound tasks expand into subtasks via methods until only primitive actions remain. “Deploy service” expands to build, test, migrate, canary, promote, each with its own sub-plan. HTN shines when you have a library of reusable methods per domain (finance close, incident response, data pipeline).

Goal-oriented decomposition

The planner states sub-goals (“confirm schema version matches main”) without prescribing tools. An executor model maps goals to tools at runtime. More flexible; harder to audit unless goals are typed and testable.

Map-reduce over documents

Split corpora by section or file, process in parallel (subagents), merge summaries upstream. Planning here is mostly partition boundaries and merge schema — not a deep HTN.

Plan representation and schema

Free-text bullet plans are fine for demos; production needs machine-readable structure stored alongside the run ID.

Minimal plan node fields

id — stable step identifier
description — human-readable intent
depends_on — list of prerequisite step IDs (DAG edges)
action_type — tool, subagent, human_gate, verify_only
success_criteria — predicate or validator ref
rollback_hint — optional compensating action
status — pending, running, succeeded, failed, skipped

Serialize as JSON validated against a versioned schema. Version bumps when you add fields (e.g. estimated_tokens, approval_tier). Store the plan in your checkpoint store so restarts resume mid-DAG without re-planning from scratch unless inputs changed.

DAG vs checklist

A checklist is a degenerate DAG (total order). Use checklists when steps never parallelize. Use a DAG when independent research threads can fan out and join at a merge step. Cycles are forbidden; if the model emits a cycle, reject at validation and ask for a revised plan.

Token budget for plans

Large plans blow the planner’s context. Cap depth (max three expansion levels), collapse completed subtrees to one-line summaries in the active prompt, and delegate leaf exploration to subagents with narrow briefs tied to plan node IDs. Track plan size in your context budget allocator separately from chat history.

Execution: binding plans to tools and agents

The executor reads the next runnable node (all dependencies succeeded), resolves action_type, and dispatches. Tool nodes map to registered functions with argument schemas; subagent nodes spawn children with a brief derived from the node description plus upstream outputs; human_gate nodes enqueue approval tasks and pause the run.

Idempotency and side effects

Tag mutating nodes with idempotency keys derived from run_id and step_id. Replays after crash must not double-charge or double-send. Read-only nodes can retry freely. Document which steps are compensatable in rollback_hint — Harbor’s fix bundled migration + app deploy into one transactional segment with a shared rollback script referenced from both nodes.

Observability

Emit spans per plan node: planned start, tool latency, validator result, replan events. Dashboards show critical-path duration and which step types fail most. Link to agent tracing so on-call can diff planned vs executed paths.

Replanning triggers

Static plans rot when the world changes. Replan when:

A validator fails after bounded retries
New user input contradicts assumptions baked into the plan
A tool returns NOT_FOUND or policy denial on a critical path
Cost or step budget exceeds threshold mid-run
External event (pager, webhook) marks a dependency stale

Replanning should pass executed history and failure diagnosis to the planner, not restart from zero. Constrain replans: “adjust only nodes downstream of step migrate_db” prevents thrashing. Cap replans per run (e.g. three) before escalating to human-in-the-loop or handoff.

Harbor DevOps refactor (case study)

Harbor’s agent previously interleaved kubectl, Terraform, and Slack tools without an explicit dependency graph. Rollbacks were best-effort prose (“undo last change”) the model interpreted inconsistently.

Method library — HTN methods for blue_green_deploy, schema_migration, feature_flag_toggle with fixed subtask order.
Plan validator — rejected plans missing health-check steps before traffic shift or missing rollback_hint on mutating nodes.
Segment locks — migration + deploy marked one segment; rollback script ran as atomic pair on failure.
Human gate — production promote required one-click approval tied to plan node promote_canary.
Replan scope — failed canary triggered replan from run_integration_tests onward, not full redeploy from build.

Partial-failure incidents fell 34% → 6%. Mean time to safe deploy dropped 22 minutes because operators saw the plan upfront and validators caught drift before full promotion.

Technique decision table

Scenario	Prefer	Avoid
Single lookup or summarize	Direct tool / one-shot	Full HTN planner
3–10 ordered steps, audit needed	Structured plan DAG + PAV	Unbounded ReAct
Parallel research on many files	Map-reduce + subagents	One thread sequential read
Irreversible mutations	Plan with rollback_hint + human gate	Retry same tool blindly
Domain with known playbooks	HTN method library	LLM reinvents steps each run
Fast-changing user chat	Lightweight next-step plan	Freeze 20-step upfront plan
Long-running workflow (hours)	Checkpointed DAG + durable execution	In-memory plan only
Validator failure	Scoped replan downstream	Restart entire run

Common pitfalls

Plans as prose only — cannot resume, diff, or validate automatically.
Over-planning latency — thirty-second planner on a two-step task.
Missing success criteria — executor cannot tell done from stuck.
No rollback on mutating steps — Harbor-style partial failures.
Cyclic dependencies — validate DAG at ingest.
Plan drift from execution — executed path not written back to store.
Replan loops — cap replans; escalate to human.
Giant monolithic plans — delegate leaves to subagents.

Production checklist

Define versioned plan JSON schema with required node fields.
Validate DAG acyclicity and dependency refs at plan ingest.
Require success_criteria on mutating and gate nodes.
Attach rollback_hint or compensating workflow to irreversible steps.
Implement PAV loop with automated validators per critical node.
Persist plan + status in checkpoint store for durable runs.
Map plan nodes to executor, subagent, tool, and human_gate dispatchers.
Enforce idempotency keys on side-effecting tools.
Cap replans; route to HITL or handoff on exhaustion.
Emit trace spans per node with planned vs actual timing.
Collapse completed subtrees in planner context to save tokens.
Measure partial-failure rate and critical-path duration pre/post.

Key takeaways

Planning is a durable contract for multi-step agent work, not a chat bullet list.
Harbor cut partial-failure deploys 34% → 6% with DAG plans and segment rollbacks.
Plan-act-verify separates execution from validation and enables scoped replans.
HTN method libraries encode domain playbooks the LLM should not reinvent.
Bind plans to checkpointing and subagents for long, parallel workflows.