Guide
LLM agent dynamic tool selection and routing explained
Harbor Platform ships an internal ops agent wired to 127 tools across
billing, CRM, incident paging, and deployment APIs. The first production
build passed the entire schema bundle on every model turn. Support tickets
asked the agent to “refund invoice 8842” and the model called
stripe_create_subscription instead of
stripe_issue_refund because both descriptions mentioned
“customer payments.”
38% of audited tool calls hit the wrong namespace or
deprecated endpoint — wasting latency, burning quota, and occasionally
mutating the wrong record. After engineering introduced a
dynamic tool router with permission filtering,
embedding-based retrieval, and per-turn top-k exposure coordinated
with
context budgets
and
approval gates,
wrong-tool rate fell to 4.2% while mean prefill tokens
per turn dropped 61%.
Dynamic tool selection is the layer that decides which tools the model may see and call on a given turn — distinct from how function calling works or how tasks decompose. Static “register everything” patterns collapse once you cross roughly 15–20 well-described tools: schemas crowd the context window, similar names compete, and the model’s pick accuracy degrades faster than raw parameter-fill accuracy. This guide covers capability registries, routing pipelines, permission envelopes, Harbor Platform’s refactor, a technique decision table, pitfalls, and a production checklist.
Why static full tool lists fail at scale
Three independent pressures break the “expose all tools always” pattern as agent integrations grow:
- Context tax — each JSON Schema tool definition costs hundreds to thousands of prefill tokens. Fifty tools can consume more context than the user message and conversation history combined, leaving little room for observations and retrieved knowledge.
- Attention dilution — models pick among
near-duplicate names (
search_contactsvslookup_contact) and cross-vendor synonyms (jira_create_issuevslinear_create_ticket) with error rates that rise superlinearly with list length. - Security and tenancy — not every principal may invoke every tool. Passing forbidden schemas “just in case” leaks capability hints and invites jailbreak attempts to call them anyway.
Dynamic selection treats the tool surface as a query-dependent view over a canonical registry, recomputed each turn (or each planning phase) rather than a fixed attachment.
The capability registry
Every routable tool should exist once in a registry with metadata the router — not the model — reads:
tool_id— stable internal key; versioned separately from display name.schema— JSON Schema passed to the provider when the tool is selected; keep descriptions imperative and disambiguating per schema design best practices.capability_tags— coarse labels (billing.refund,crm.read) for rule filters.embedding_doc— a short paragraph optimized for retrieval: when to use, when not to use, example utterances, known confusions.risk_tier— read / write / destructive; drives approval gates and audit policies.tenant_scope— which orgs, roles, or feature flags may bind this tool at runtime.dependencies— optional tools that should be co-exposed when this one is selected (e.g. refund + invoice lookup).
The registry is the source of truth; MCP servers, OpenAPI importers, and hand-written adapters all normalize into it. Never maintain parallel tool lists in prompt templates and code — they drift within days.
Routing pipeline: filter, retrieve, rank, expose
Production routers typically run four stages before the model sees a tool list:
1. Hard permission filter
Intersect the registry with the run’s principal, tenant, and session entitlements. Tools that fail this gate are invisible — not listed with a “denied” flag. This stage is deterministic and logged for compliance.
2. Intent and phase gate
A lightweight intent classifier or planner output narrows capability tags: a “deploy status” query should not surface CRM writes even if embeddings are noisy. Multi-phase plans may expose different subsets per step — read tools during research, write tools only after confirmation.
3. Embedding retrieval over embedding_doc
Embed the user message (plus recent tool results summary) and retrieve top-k candidates from the filtered set. Hybrid sparse+dense helps on SKU codes and ticket IDs. Typical k is 5–12 for a single-turn agent, higher when the planner expects multi-domain work.
4. Re-rank and dependency expand
A cross-encoder or small reranker model scores (query, tool) pairs.
Pull in declared dependencies and always include a small
core toolkit (e.g. search_runbook,
ask_user_clarification) that never gets filtered out.
Serialize the final list into the provider’s tools parameter.
Log the router decision: candidate pool size, retrieved IDs, final exposure set, and scores. When the model calls a tool not in the exposed set, that is a runtime bug — block and alert.
Hierarchical routing and two-stage select-execute
When tool count reaches hundreds, flat top-k retrieval still confuses sibling APIs. Two patterns scale further:
- Taxonomy router — first stage picks a domain
(
billing,crm,infra); second stage retrieves only within that subtree. Reduces cross-domain false positives at the cost of one extra classifier call. - Meta-tool delegation — expose a single
invoke_capabilitymeta-tool plus domain summaries; the runtime resolves the inner tool. Higher latency but caps schema tokens aggressively; pairs well with subagent delegation. - Two-stage select-execute — a small fast model
or classifier outputs explicit
selected_tool_ids; the main model receives only those schemas on the execution turn. Decouples routing cost from reasoning cost.
Harbor Platform adopted taxonomy routing for ops domains and flat retrieval within each domain, with a shared core toolkit of four utilities on every turn.
Context budget and schema pruning
Dynamic selection and context management are one problem. After routing:
- Strip internal fields from schemas sent to the model — keep full validation schema server-side.
- Collapse enum-heavy parameters into free-text plus server-side validation when enums exceed ~20 values.
- Defer rarely used optional properties to a second “expand schema” tool the model calls only when needed.
- Measure prefill share per route in cost attribution dashboards; regressions often mean a bloated schema slipped back into the default exposure set.
Harbor Platform refactor walkthrough
The team shipped five changes over three sprints:
- Unified registry — 127 adapters normalized;
duplicate Stripe and internal billing tools merged under distinct
tool_ids with explicit “do not use for” lines inembedding_doc. - Permission envelope middleware — runs before routing; integrates with SSO role claims and per-tenant feature flags.
- Hybrid retriever — dense embeddings plus BM25 on tool names and tags; k=8 within the classified domain.
- Core toolkit pin — four read-only utilities always exposed; write tools require rerank score above threshold or explicit planner mention.
- Wrong-tool telemetry — human reviewers label mispicks; weekly retrain of embedding docs and disambiguation strings.
Wrong-tool rate dropped from 38% to 4.2% on a 600-ticket audit set. Mean agent latency fell 18% because fewer invalid calls and smaller prefills. Remaining errors were mostly ambiguous user phrasing (“cancel the account” — subscription vs user login) routed to clarification rather than guessing.
Technique decision table
| Approach | Strengths | Weaknesses | Best for |
|---|---|---|---|
| Static full tool list | Simple, no router to maintain | Context bloat, mispicks, tenancy leaks | ≤15 tools, single domain |
| Rule-based tag filter | Deterministic, auditable | Brittle on novel queries | Hard compliance boundaries |
| Embedding top-k retrieval | Scales to hundreds of tools | Needs good docs, periodic eval | Multi-domain internal agents |
| Taxonomy + retrieval | Lower cross-domain confusion | Extra classifier latency | 100+ tools, clear domains |
| Meta-tool / subagent | Minimal schema tokens | Two-hop latency, harder debugging | Extreme tool count, MCP federation |
Common pitfalls
- Router-model skew — router exposes tool A but execution handler only implements B; always validate exposure set against live adapters.
- Stale embedding docs — API renamed but retrieval paragraph still says old name; schedule doc updates with adapter releases.
- Over-filtering write tools — agent loops forever because refund tool never surfaces; use planner hints to boost risk-tier tools when intent is explicit.
- Duplicate semantics — two tools do the same thing with different vendors; models flip a coin. Deprecate one in registry metadata.
- No fallback exposure — retrieval returns empty on OOD queries; always widen to core toolkit or ask clarification.
- Logging selected schemas in traces — leaks internal API surface; log IDs and scores only.
- Ignoring parallel call graphs — router picks tools independently each turn but parallel batches need co-exposed siblings; use dependency expansion.
Production checklist
- Centralize all tools in a versioned capability registry with tags and risk tiers.
- Run permission filter before any retrieval or model exposure.
- Maintain
embedding_docper tool with use / do-not-use examples. - Choose k and taxonomy depth based on measured mispick rate, not guesswork.
- Pin a small core read-only toolkit on every turn.
- Log router inputs, candidate sets, final exposure, and model picks for audit.
- Block model calls to tools not in the current exposure set.
- Integrate high-risk tools with approval gates regardless of router score.
- Prune schemas for prefill; validate full args server-side.
- Evaluate weekly on labeled mispick set; update docs when APIs change.
- Alert when retrieval returns zero candidates above threshold.
- Coordinate router output with planner phases and subagent handoffs.
Key takeaways
- Tool selection is a systems problem, not a prompt tweak — registries, routers, and permissions belong in middleware.
- Static full lists fail on context, accuracy, and security once integrations grow past a single domain.
- Filter → retrieve → rank → expose is the standard pipeline; taxonomy and meta-tools extend it further.
- Harbor Platform cut wrong-tool picks 38% → 4.2% with unified registry, hybrid retrieval, and core toolkit pinning.
- Measure mispicks and prefill tokens together — routing wins that hide needed tools are not wins.
Related reading
- LLM function calling explained — schemas, call loops, and provider APIs
- LLM agent planning and task decomposition explained — when to expose tools per plan phase
- LLM agent permission scoping and tool approval gates explained — entitlements before routing
- LLM intent classification and query routing explained — domain gates upstream of retrieval