Guide
LLM dynamic tool selection explained
Harbor Support's enterprise rollout connected 140 tools across billing, logistics,
identity, and partner APIs through an MCP-style catalog. Schemas were well designed;
validation error rates were low. Yet first-try tool success on refund tickets sat at 54%.
The model repeatedly chose reprint_shipping_label when the customer needed
initiate_chargeback_review, or called list_all_products before
a simple get_order. Latency climbed because every turn stuffed 38K tokens of
tool definitions into the prompt, crowding out policy snippets and order JSON.
The fix was not another schema rewrite. It was dynamic tool selection: choosing which subset of the catalog to expose on each turn based on user intent, session state, and a hard token budget. After adding retrieval over tool embeddings, a lightweight intent router, and pinned “always available” core tools, exposed tools dropped to 8–12 per turn and first-try success rose to 89%. This guide covers what dynamic selection is, selection pipeline stages, retrieval and routing patterns, session allowlists, the Harbor Support refactor, a technique decision table versus static full-catalog exposure, pitfalls, and a production checklist.
What dynamic tool selection is
Dynamic tool selection is the layer that decides which tool definitions
the model sees on a given turn. It sits between your full tool registry (which may hold
hundreds of connectors) and the
function calling
runtime that parses tool_calls from the response.
Good tool schema design tells the model how to call a tool once it is visible. Selection answers a different question: which tools should be visible right now? Without selection, large catalogs create three failures:
- Choice overload — models pick plausible-sounding but wrong tools when dozens of similar names compete.
- Context starvation — tool JSON consumes tokens that should carry customer data, retrieved docs, and system policy.
- Latency and cost — bigger prompts slow prefill; more tools increase bad-call retries.
Selection is not the same as query routing to different models or pipelines, though intent classifiers often drive it. It is also distinct from parallel tool calling, which schedules multiple calls among tools already exposed.
Selection pipeline stages
A production selector runs as a deterministic pre-step before each model call:
- Collect candidates — start from the tenant's full registry (MCP servers, OpenAPI imports, hand-written tools).
- Filter by policy — role-based allowlists, feature flags, and kill switches remove forbidden tools before the model sees them.
- Score relevance — rank remaining tools against the latest user message, conversation summary, and structured session state (order ID known, refund stage, etc.).
- Apply budget — take top-k until a token ceiling for tool definitions is reached.
- Pin core tools — merge non-negotiable utilities (e.g.
escalate_to_human,get_order) that must never be pruned. - Log the slate — record which tools were offered vs called for offline eval and regression tests.
The output is a fresh tools array per request. Cached conversation state
should store selected tool IDs from the prior turn only for debugging, not as a hard
lock — user intent can shift abruptly mid-thread.
Retrieval-based tool ranking
When catalogs exceed ~20 tools, keyword matching fails. Embed each tool's
name, description, and optional tags into a vector index.
At selection time, embed the query (latest user message plus a one-line session summary)
and retrieve top-k tools by cosine similarity.
Practical refinements:
- Hybrid retrieval — BM25 on tool names catches exact SKU and API identifiers embeddings miss; fuse with reciprocal rank fusion.
- Hard negatives in the index — store “do not use for refunds” phrases in descriptions of logistics-only tools to push them down on billing queries.
- Re-rank top 20 — a small cross-encoder or cheap LLM grader on shortlists beats bi-encoder alone for near-duplicate tool names.
- Refresh on session events — when
order_idis extracted, boost order-scoped tools without re-embedding the whole catalog.
Harbor Support indexes 140 tools with 384-dim embeddings, retrieves 25, re-ranks to 10, then merges 3 pinned core tools. Total tool-definition tokens stay under 2,400.
Intent routing and hierarchical selection
Embedding retrieval works well for open-ended chat. Structured support flows benefit from an explicit intent layer:
| Pattern | Mechanism | Best when |
|---|---|---|
| Flat top-k | Single retrieval pass, merge pins | <40 tools, general assistants |
| Intent → bundle | Classifier picks “refund” → expose refund bundle only | Repeatable workflows, compliance boundaries |
| Two-stage router | Category tool first (select_domain), then domain tools |
Very large catalogs, strict separation of duties |
| Planner loop | ReAct agent requests list_tools with a filter |
Research agents, exploratory coding |
| Static partition | Per-tenant hardcoded sets by product surface | Small known SKUs, highest reliability |
Harbor uses intent → bundle for the top eight ticket types (refund, shipping delay, account access, etc.) and falls back to embedding retrieval for “other.” Bundles are versioned JSON lists maintained by ops, not inferred at runtime.
Session state, allowlists, and safety
Selection must respect authorization, not just relevance:
- Principal-scoped registry — agents for read-only roles never receive write tools in the candidate set, regardless of retrieval score.
- Stage gates —
issue_refundappears only afterget_ordersucceeds and policy checks pass server-side. - Cooldown on dangerous tools — after
delete_customer_datais offered once and rejected by policy, remove it for the session. - Cross-tenant isolation — tool metadata is tenant-keyed; embedding indexes must not leak descriptions across customers.
Pair dynamic selection with guardrails on execution: hiding a tool from the model is not a substitute for server-side permission checks on every handler.
Harbor Support enterprise catalog refactor
Before selection, Harbor passed all 140 tool schemas on every turn. Median prompt size was 41K tokens; p95 time-to-first-token exceeded 4.2s. Wrong-tool rate on billing intents was 38%.
After refactor:
- Indexed tools with name, description, domain tag, and example user phrases.
- Added intent classifier (fine-tuned small model) for eight high-volume ticket types.
- Pinned
get_order,lookup_customer,escalate_to_humanon every turn. - Capped tool-definition budget at 2,500 tokens; overflow drops lowest-scored tools.
- Logged
offered_toolsvscalled_toolper turn for weekly eval.
Results on a 2,000-ticket holdout: first-try correct tool 89% (was 54%), median latency 1.8s (was 4.2s), validation errors unchanged at 4% (schema quality was already good). The lesson: selection and schema design are complementary investments.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Expose full catalog | ≤15 tools, single domain | Enterprise MCP meshes, multi-tenant SaaS |
| Embedding retrieval + top-k | 40–500 tools, varied user phrasing | Strict compliance needs hard bundles only |
| Intent → tool bundles | Repeatable workflows, audit requirements | Exploratory coding agents, open research |
| Two-stage router tool | 1,000+ tools, clear domain boundaries | Latency-sensitive voice interfaces |
| Static per-surface sets | Known UI with fixed capabilities | Users expect open-ended “do anything” agents |
Common pitfalls
- Sticky selection — reusing the previous turn's tool set when the user changes topic; re-run selection every turn.
- Pruning core tools — retrieval drops
escalate_to_human; maintain a pin list outside the budget cap. - Embedding stale descriptions — schema changes without re-indexing send the model outdated triggers.
- Over-bundling — intent bundles with 40 tools recreate overload; keep bundles under 12.
- Selection without execution auth — hiding
admin_deletefrom the prompt but leaving the HTTP endpoint open. - No offline eval — cannot detect recall@k regressions when adding tools; maintain golden queries per intent.
- Ignoring tool-call history — after a failed call, boost sibling tools or surface error-specific alternatives on retry.
Production checklist
- Measure tool-definition token share of total prompt; set a hard ceiling.
- Index every tool with name, description, domain tags, and example phrases.
- Implement policy filter before relevance scoring (role, tenant, feature flags).
- Pin safety and escalation tools outside the retrieval budget.
- Re-run selection on each turn using latest user message and session summary.
- Log offered tool IDs, scores, and final called tool for every turn.
- Maintain golden-query eval sets per intent; track recall@k and wrong-tool rate.
- Re-index embeddings when schemas or descriptions change.
- Stage-gate write tools behind successful read prerequisites server-side.
- Fall back to a smaller static bundle if retrieval service is down.
- A/B test top-k and bundle sizes against latency and first-try success.
- Document bundle ownership and review cadence with ops and compliance.
Key takeaways
- Dynamic tool selection chooses which schemas the model sees each turn — essential once catalogs exceed a dozen tools.
- Combine embedding retrieval, intent bundles, and pinned core tools; no single pattern scales every catalog.
- Harbor Support raised first-try tool success from 54% to 89% without retraining by pruning 140 tools down to 8–12 per turn.
- Selection saves context for customer data and policy; it does not replace schema quality or server-side authorization.
- Log offered vs called tools and run recall@k evals whenever the registry grows.
Related reading
- LLM tool schema design explained — JSON Schema structure and parameter patterns once a tool is selected
- LLM function calling explained — runtime parsing and execution of tool_calls
- LLM intent classification and query routing explained — routing user messages to pipelines and bundles
- LLM parallel tool calling explained — scheduling multiple calls among exposed tools