Guide

LLM agent subagent delegation explained

Harbor Analytics’ private-market due-diligence agent was a single ReAct loop with forty-seven tools: SEC EDGAR search, news APIs, spreadsheet export, CRM updates, and a final memo writer. A typical engagement pulled twelve 10-K filings, six earnings transcripts, and two hundred news articles into one growing message thread. By step nineteen the parent context exceeded 180K tokens, the context budget allocator started evicting early tool results, and the model cited a revenue figure from the wrong fiscal year. Human reviewers flagged 39% of memos for at least one factual error tied to context confusion or tool misuse.

Subagent delegation splits work: a supervisor agent plans and synthesizes; specialized child agents run in isolated contexts with narrower tool allowlists, fixed output schemas, and independent budgets. The parent sees only structured summaries — not every raw PDF chunk. This guide covers when to delegate versus call a tool inline, spawn-and-wait versus parallel fan-out, permission scoping, parent-child trace linking, durable child runs, failure propagation, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

Delegation vs inline tools

Not every multi-step task needs a subagent. A subagent is justified when the subtask has its own exploration loop (many tool calls whose intermediate reasoning should not pollute the parent), a different skill profile (code execution vs legal summarization), or a security boundary (child may read customer PII; parent may not). Use a direct tool call when the operation is atomic: lookup_ticker, send_slack_message, run_sql_query with a bounded result.

Rule of thumb: if completing the subtask would add more than three tool rounds or more than ~8K tokens of raw observations to the parent, delegate. If the subtask shares the parent’s goal and tools, inline is cheaper and easier to debug.

Anti-patterns

  • Subagent per tool call — overhead dominates; use tools.
  • Subagent as a fancy prompt wrapper — if there is no loop, use structured output instead.
  • Deep nesting without budgets — supervisor → researcher → fetcher → parser stacks explode cost.

Supervisor spawn patterns

Production systems converge on a small set of orchestration shapes:

Spawn-and-wait (synchronous delegation)

Parent calls delegate_research(task_spec); runtime starts a child run, blocks until the child returns a result envelope, then continues. Simple to reason about; parent wall-clock includes child runtime. Best for one-off subtasks under a few minutes.

Parallel fan-out (map-reduce)

Parent spawns N children with disjoint scopes (one per competitor, one per document batch), waits on all handles, then merges. Requires explicit merge logic — often a second model call with only child summaries in context. Pair with concurrency caps so a enthusiastic supervisor does not launch fifty simultaneous research agents.

Pipeline handoff

Child A produces a typed artifact (extracted table JSON); Child B consumes it. The parent orchestrates stage gates but does not re-ingest raw text. Works well for ETL-style agent workflows.

Long-running child with callback

Child runs as a durable workflow for hours; parent checkpoints in waiting_child and resumes on webhook. Essential when subtasks include human review inside the child.

Context isolation and result envelopes

The main benefit of delegation is context firewalling. The child retains full message history for its own loop; the parent receives only a contract-shaped payload:

{
  "child_run_id": "run_child_7c2…",
  "status": "completed",
  "task_spec_hash": "b91e…",
  "summary": "Acme Corp FY2025 revenue $4.2B (+12% YoY) per 10-K Item 8.",
  "structured_facts": [
    { "field": "revenue_fy2025", "value": 4200000000, "unit": "USD",
      "source": "sec://0001234567-25-000042#item8", "confidence": 0.97 }
  ],
  "citations": ["sec://0001234567-25-000042"],
  "tokens_used": { "in": 84000, "out": 3200 },
  "warnings": ["Q3 transcript unavailable; used 10-K only"]
}

Enforce a maximum summary length (tokens or characters) at the runtime boundary. Reject child outputs that omit required fields or exceed bounds — trigger one repair pass inside the child before surfacing failure to the parent. Optionally store the child’s full trace in object storage; link by child_run_id for auditors without loading it into the parent prompt.

Summarization inside the child should be loss-aware: separate “facts with citations” from “narrative prose.” Parents synthesize narrative; they should not re-derive numbers from prose alone.

Permission scoping and tool allowlists

Each subagent profile declares a capability manifest:

  • Tool allowlist — e.g. edgar_search, fetch_filing_section only; no crm_write.
  • Data scope — tenant ID, folder prefix, row-level filters injected by the runtime, not chosen by the model.
  • Model tier — cheap model for extraction children; capable model for supervisor synthesis.
  • Mutability class — read-only children by default; write tools require elevated profile and often HITL.

The parent’s delegate_* tool should accept only a profile_id and structured task_spec — not raw tool lists from the model. Prevents a compromised or confused supervisor from spawning a child with admin credentials. Align with sandbox execution when children run code.

Budget allocation and termination

Subagents need independent ceilings enforced by the runtime, not the prompt:

  • Token budget — hard stop when child exceeds max_tokens_in + max_tokens_out; return partial envelope with status: budget_exceeded.
  • Step budget — max ReAct iterations per child; surfaces status: step_limit so parent can narrow the task.
  • Wall-clock timeout — kill runaway children; parent decides retry vs escalate.
  • Cost attribution — tag child spend with parent_run_id for chargeback and trace dashboards.

The parent’s own termination policy should count delegated work: either pre-reserve budget for expected children or fail closed when remaining budget cannot cover a spawn request.

Parent-child tracing and debugging

Nested agents fail opaquely without hierarchy in your observability backend:

  • Propagate trace_id from parent; assign distinct run_id per child.
  • Emit span events: child_spawned, child_completed, child_failed with profile, budgets, and envelope digest.
  • Surface a tree UI: operators expand the supervisor span to see which child hallucinated a citation.
  • Log task_spec_hash so replaying a child does not require reconstructing the parent’s full chat.

When a child fails, return structured errors to the parent (error_code, retryable, user_message) rather than a stack trace string the model might misinterpret as ground truth.

Failure propagation and partial results

Parallel fan-out rarely achieves 100% child success. Parent merge logic should handle:

  • Partial success — memo proceeds with explicit gaps (“Competitor B data unavailable”).
  • Retry policy — retry transient child failures once with narrowed scope; do not blindly re-spawn identical tasks.
  • Supervisor escalation — after two child failures on the same subtask, pause for human routing.
  • Idempotent child spawns — key parent_run_id + subtask_id so resume does not duplicate expensive EDGAR pulls.

Avoid “best effort” silent drops: if a child fails, the parent prompt must include the failure reason in the merge context so the final output does not imply completeness.

Harbor Analytics refactor

Harbor replaced the monolithic due-diligence loop with a three-tier design:

  • Supervisor — plans sections, spawns children, writes final memo; tools: delegate_*, merge_facts, export_memo only.
  • Filing researcher — read-only SEC tools, returns structured_facts envelope per company per fiscal year.
  • News sentiment researcher — news APIs with 90-day window cap; returns theme tags and cited headlines, not full article bodies.

Parallel fan-out across four target companies cut wall-clock research phase 38 → 14 minutes. Parent context at memo-writing step 182K → 41K tokens (78% reduction). Human-flagged factual errors per memo 39% → 11%; on spot audits, verified claim accuracy 61% → 89%. Total token spend per engagement rose ~12% (extra supervisor merge calls) but reviewer rework hours fell enough to net positive margin on fixed-fee engagements.

Technique decision table

Need Prefer Avoid
Single API lookup Inline tool Subagent spawn
Multi-document research Read-only child + fact envelope Parent ingests all PDFs
Compare N entities Parallel children + merge Sequential monolithic loop
Untrusted code execution Sandboxed child profile Parent with shell tool
Hour-long sub-workflow Durable child + parent wait state Blocking HTTP to child
Debug wrong citation Child trace tree by run_id Re-run entire supervisor
Strict output schema Child with grammar/JSON validation Free-form prose to parent

Common pitfalls

  • Returning raw child chat to parent — defeats isolation; use envelopes.
  • Unbounded parallel spawns — rate limits and cost spikes; cap concurrency.
  • Same tool allowlist on parent and child — no security or focus benefit.
  • No task_spec versioning — cannot reproduce or audit child inputs.
  • Parent rewrites child numbers — merge from structured_facts, not prose.
  • Missing failure in merge — silent incomplete memos.
  • Orphan child runs — parent crashes; children keep burning budget; tie to parent lifecycle.
  • Depth > 2 without strong need — prefer flat supervisor + specialists.

Production checklist

  • Define delegate-vs-inline heuristics (tool rounds, token volume, security).
  • Ship typed result envelopes with citations, budgets, and status codes.
  • Enforce per-child token, step, and wall-clock limits in the runtime.
  • Scope tool allowlists and data filters by profile_id, not model choice.
  • Propagate trace_id; assign unique run_id per child.
  • Cap parallel spawns; queue excess work with backpressure.
  • Idempotency-key child spawns on parent resume.
  • Handle partial fan-out failures explicitly in parent merge prompts.
  • Store full child traces out-of-band; link from parent span.
  • Attribute child cost to parent_run_id for dashboards.
  • Validate child output schema before parent continues.
  • Cancel or orphan-guard children when parent enters cancelled.

Key takeaways

  • Delegate when subtasks need their own loop, not for every tool call.
  • Context isolation via envelopes keeps parents small and reduces cross-contamination.
  • Harbor cut parent context 78% and raised verified accuracy 61% → 89%.
  • Permission scoping and budgets belong in the runtime, not the system prompt.
  • Trace hierarchies make nested failures debuggable without rerunning everything.

Related reading