Guide

LLM agent dynamic tool selection and routing explained

Harbor Platform ships an internal ops agent wired to 127 tools across billing, CRM, incident paging, and deployment APIs. The first production build passed the entire schema bundle on every model turn. Support tickets asked the agent to “refund invoice 8842” and the model called stripe_create_subscription instead of stripe_issue_refund because both descriptions mentioned “customer payments.” 38% of audited tool calls hit the wrong namespace or deprecated endpoint — wasting latency, burning quota, and occasionally mutating the wrong record. After engineering introduced a dynamic tool router with permission filtering, embedding-based retrieval, and per-turn top-k exposure coordinated with context budgets and approval gates, wrong-tool rate fell to 4.2% while mean prefill tokens per turn dropped 61%.

Dynamic tool selection is the layer that decides which tools the model may see and call on a given turn — distinct from how function calling works or how tasks decompose. Static “register everything” patterns collapse once you cross roughly 15–20 well-described tools: schemas crowd the context window, similar names compete, and the model’s pick accuracy degrades faster than raw parameter-fill accuracy. This guide covers capability registries, routing pipelines, permission envelopes, Harbor Platform’s refactor, a technique decision table, pitfalls, and a production checklist.

Why static full tool lists fail at scale

Three independent pressures break the “expose all tools always” pattern as agent integrations grow:

Context tax — each JSON Schema tool definition costs hundreds to thousands of prefill tokens. Fifty tools can consume more context than the user message and conversation history combined, leaving little room for observations and retrieved knowledge.
Attention dilution — models pick among near-duplicate names (search_contacts vs lookup_contact) and cross-vendor synonyms (jira_create_issue vs linear_create_ticket) with error rates that rise superlinearly with list length.
Security and tenancy — not every principal may invoke every tool. Passing forbidden schemas “just in case” leaks capability hints and invites jailbreak attempts to call them anyway.

Dynamic selection treats the tool surface as a query-dependent view over a canonical registry, recomputed each turn (or each planning phase) rather than a fixed attachment.

The capability registry

Every routable tool should exist once in a registry with metadata the router — not the model — reads:

tool_id — stable internal key; versioned separately from display name.
schema — JSON Schema passed to the provider when the tool is selected; keep descriptions imperative and disambiguating per schema design best practices.
capability_tags — coarse labels (billing.refund, crm.read) for rule filters.
embedding_doc — a short paragraph optimized for retrieval: when to use, when not to use, example utterances, known confusions.
risk_tier — read / write / destructive; drives approval gates and audit policies.
tenant_scope — which orgs, roles, or feature flags may bind this tool at runtime.
dependencies — optional tools that should be co-exposed when this one is selected (e.g. refund + invoice lookup).

The registry is the source of truth; MCP servers, OpenAPI importers, and hand-written adapters all normalize into it. Never maintain parallel tool lists in prompt templates and code — they drift within days.

Routing pipeline: filter, retrieve, rank, expose

Production routers typically run four stages before the model sees a tool list:

1. Hard permission filter

Intersect the registry with the run’s principal, tenant, and session entitlements. Tools that fail this gate are invisible — not listed with a “denied” flag. This stage is deterministic and logged for compliance.

2. Intent and phase gate

A lightweight intent classifier or planner output narrows capability tags: a “deploy status” query should not surface CRM writes even if embeddings are noisy. Multi-phase plans may expose different subsets per step — read tools during research, write tools only after confirmation.

3. Embedding retrieval over `embedding_doc`

Embed the user message (plus recent tool results summary) and retrieve top-k candidates from the filtered set. Hybrid sparse+dense helps on SKU codes and ticket IDs. Typical k is 5–12 for a single-turn agent, higher when the planner expects multi-domain work.

4. Re-rank and dependency expand

A cross-encoder or small reranker model scores (query, tool) pairs. Pull in declared dependencies and always include a small core toolkit (e.g. search_runbook, ask_user_clarification) that never gets filtered out. Serialize the final list into the provider’s tools parameter.

Log the router decision: candidate pool size, retrieved IDs, final exposure set, and scores. When the model calls a tool not in the exposed set, that is a runtime bug — block and alert.

Hierarchical routing and two-stage select-execute

When tool count reaches hundreds, flat top-k retrieval still confuses sibling APIs. Two patterns scale further:

Taxonomy router — first stage picks a domain (billing, crm, infra); second stage retrieves only within that subtree. Reduces cross-domain false positives at the cost of one extra classifier call.
Meta-tool delegation — expose a single invoke_capability meta-tool plus domain summaries; the runtime resolves the inner tool. Higher latency but caps schema tokens aggressively; pairs well with subagent delegation.
Two-stage select-execute — a small fast model or classifier outputs explicit selected_tool_ids; the main model receives only those schemas on the execution turn. Decouples routing cost from reasoning cost.

Harbor Platform adopted taxonomy routing for ops domains and flat retrieval within each domain, with a shared core toolkit of four utilities on every turn.

Context budget and schema pruning

Dynamic selection and context management are one problem. After routing:

Strip internal fields from schemas sent to the model — keep full validation schema server-side.
Collapse enum-heavy parameters into free-text plus server-side validation when enums exceed ~20 values.
Defer rarely used optional properties to a second “expand schema” tool the model calls only when needed.
Measure prefill share per route in cost attribution dashboards; regressions often mean a bloated schema slipped back into the default exposure set.

Harbor Platform refactor walkthrough

The team shipped five changes over three sprints:

Unified registry — 127 adapters normalized; duplicate Stripe and internal billing tools merged under distinct tool_ids with explicit “do not use for” lines in embedding_doc.
Permission envelope middleware — runs before routing; integrates with SSO role claims and per-tenant feature flags.
Hybrid retriever — dense embeddings plus BM25 on tool names and tags; k=8 within the classified domain.
Core toolkit pin — four read-only utilities always exposed; write tools require rerank score above threshold or explicit planner mention.
Wrong-tool telemetry — human reviewers label mispicks; weekly retrain of embedding docs and disambiguation strings.

Wrong-tool rate dropped from 38% to 4.2% on a 600-ticket audit set. Mean agent latency fell 18% because fewer invalid calls and smaller prefills. Remaining errors were mostly ambiguous user phrasing (“cancel the account” — subscription vs user login) routed to clarification rather than guessing.

Technique decision table

Approach	Strengths	Weaknesses	Best for
Static full tool list	Simple, no router to maintain	Context bloat, mispicks, tenancy leaks	≤15 tools, single domain
Rule-based tag filter	Deterministic, auditable	Brittle on novel queries	Hard compliance boundaries
Embedding top-k retrieval	Scales to hundreds of tools	Needs good docs, periodic eval	Multi-domain internal agents
Taxonomy + retrieval	Lower cross-domain confusion	Extra classifier latency	100+ tools, clear domains
Meta-tool / subagent	Minimal schema tokens	Two-hop latency, harder debugging	Extreme tool count, MCP federation

Common pitfalls

Router-model skew — router exposes tool A but execution handler only implements B; always validate exposure set against live adapters.
Stale embedding docs — API renamed but retrieval paragraph still says old name; schedule doc updates with adapter releases.
Over-filtering write tools — agent loops forever because refund tool never surfaces; use planner hints to boost risk-tier tools when intent is explicit.
Duplicate semantics — two tools do the same thing with different vendors; models flip a coin. Deprecate one in registry metadata.
No fallback exposure — retrieval returns empty on OOD queries; always widen to core toolkit or ask clarification.
Logging selected schemas in traces — leaks internal API surface; log IDs and scores only.
Ignoring parallel call graphs — router picks tools independently each turn but parallel batches need co-exposed siblings; use dependency expansion.

Production checklist

Centralize all tools in a versioned capability registry with tags and risk tiers.
Run permission filter before any retrieval or model exposure.
Maintain embedding_doc per tool with use / do-not-use examples.
Choose k and taxonomy depth based on measured mispick rate, not guesswork.
Pin a small core read-only toolkit on every turn.
Log router inputs, candidate sets, final exposure, and model picks for audit.
Block model calls to tools not in the current exposure set.
Integrate high-risk tools with approval gates regardless of router score.
Prune schemas for prefill; validate full args server-side.
Evaluate weekly on labeled mispick set; update docs when APIs change.
Alert when retrieval returns zero candidates above threshold.
Coordinate router output with planner phases and subagent handoffs.

Key takeaways

Tool selection is a systems problem, not a prompt tweak — registries, routers, and permissions belong in middleware.
Static full lists fail on context, accuracy, and security once integrations grow past a single domain.
Filter → retrieve → rank → expose is the standard pipeline; taxonomy and meta-tools extend it further.
Harbor Platform cut wrong-tool picks 38% → 4.2% with unified registry, hybrid retrieval, and core toolkit pinning.
Measure mispicks and prefill tokens together — routing wins that hide needed tools are not wins.