Guide

LLM dynamic tool selection explained

Harbor Support's enterprise rollout connected 140 tools across billing, logistics, identity, and partner APIs through an MCP-style catalog. Schemas were well designed; validation error rates were low. Yet first-try tool success on refund tickets sat at 54%. The model repeatedly chose reprint_shipping_label when the customer needed initiate_chargeback_review, or called list_all_products before a simple get_order. Latency climbed because every turn stuffed 38K tokens of tool definitions into the prompt, crowding out policy snippets and order JSON.

The fix was not another schema rewrite. It was dynamic tool selection: choosing which subset of the catalog to expose on each turn based on user intent, session state, and a hard token budget. After adding retrieval over tool embeddings, a lightweight intent router, and pinned “always available” core tools, exposed tools dropped to 8–12 per turn and first-try success rose to 89%. This guide covers what dynamic selection is, selection pipeline stages, retrieval and routing patterns, session allowlists, the Harbor Support refactor, a technique decision table versus static full-catalog exposure, pitfalls, and a production checklist.

What dynamic tool selection is

Dynamic tool selection is the layer that decides which tool definitions the model sees on a given turn. It sits between your full tool registry (which may hold hundreds of connectors) and the function calling runtime that parses tool_calls from the response.

Good tool schema design tells the model how to call a tool once it is visible. Selection answers a different question: which tools should be visible right now? Without selection, large catalogs create three failures:

Choice overload — models pick plausible-sounding but wrong tools when dozens of similar names compete.
Context starvation — tool JSON consumes tokens that should carry customer data, retrieved docs, and system policy.
Latency and cost — bigger prompts slow prefill; more tools increase bad-call retries.

Selection is not the same as query routing to different models or pipelines, though intent classifiers often drive it. It is also distinct from parallel tool calling, which schedules multiple calls among tools already exposed.

Selection pipeline stages

A production selector runs as a deterministic pre-step before each model call:

Collect candidates — start from the tenant's full registry (MCP servers, OpenAPI imports, hand-written tools).
Filter by policy — role-based allowlists, feature flags, and kill switches remove forbidden tools before the model sees them.
Score relevance — rank remaining tools against the latest user message, conversation summary, and structured session state (order ID known, refund stage, etc.).
Apply budget — take top-k until a token ceiling for tool definitions is reached.
Pin core tools — merge non-negotiable utilities (e.g. escalate_to_human, get_order) that must never be pruned.
Log the slate — record which tools were offered vs called for offline eval and regression tests.

The output is a fresh tools array per request. Cached conversation state should store selected tool IDs from the prior turn only for debugging, not as a hard lock — user intent can shift abruptly mid-thread.

Retrieval-based tool ranking

When catalogs exceed ~20 tools, keyword matching fails. Embed each tool's name, description, and optional tags into a vector index. At selection time, embed the query (latest user message plus a one-line session summary) and retrieve top-k tools by cosine similarity.

Practical refinements:

Hybrid retrieval — BM25 on tool names catches exact SKU and API identifiers embeddings miss; fuse with reciprocal rank fusion.
Hard negatives in the index — store “do not use for refunds” phrases in descriptions of logistics-only tools to push them down on billing queries.
Re-rank top 20 — a small cross-encoder or cheap LLM grader on shortlists beats bi-encoder alone for near-duplicate tool names.
Refresh on session events — when order_id is extracted, boost order-scoped tools without re-embedding the whole catalog.

Harbor Support indexes 140 tools with 384-dim embeddings, retrieves 25, re-ranks to 10, then merges 3 pinned core tools. Total tool-definition tokens stay under 2,400.

Intent routing and hierarchical selection

Embedding retrieval works well for open-ended chat. Structured support flows benefit from an explicit intent layer:

Pattern	Mechanism	Best when
Flat top-k	Single retrieval pass, merge pins	<40 tools, general assistants
Intent → bundle	Classifier picks “refund” → expose refund bundle only	Repeatable workflows, compliance boundaries
Two-stage router	Category tool first (`select_domain`), then domain tools	Very large catalogs, strict separation of duties
Planner loop	ReAct agent requests `list_tools` with a filter	Research agents, exploratory coding
Static partition	Per-tenant hardcoded sets by product surface	Small known SKUs, highest reliability

Harbor uses intent → bundle for the top eight ticket types (refund, shipping delay, account access, etc.) and falls back to embedding retrieval for “other.” Bundles are versioned JSON lists maintained by ops, not inferred at runtime.

Session state, allowlists, and safety

Selection must respect authorization, not just relevance:

Principal-scoped registry — agents for read-only roles never receive write tools in the candidate set, regardless of retrieval score.
Stage gates — issue_refund appears only after get_order succeeds and policy checks pass server-side.
Cooldown on dangerous tools — after delete_customer_data is offered once and rejected by policy, remove it for the session.
Cross-tenant isolation — tool metadata is tenant-keyed; embedding indexes must not leak descriptions across customers.

Pair dynamic selection with guardrails on execution: hiding a tool from the model is not a substitute for server-side permission checks on every handler.

Harbor Support enterprise catalog refactor

Before selection, Harbor passed all 140 tool schemas on every turn. Median prompt size was 41K tokens; p95 time-to-first-token exceeded 4.2s. Wrong-tool rate on billing intents was 38%.

After refactor:

Indexed tools with name, description, domain tag, and example user phrases.
Added intent classifier (fine-tuned small model) for eight high-volume ticket types.
Pinned get_order, lookup_customer, escalate_to_human on every turn.
Capped tool-definition budget at 2,500 tokens; overflow drops lowest-scored tools.
Logged offered_tools vs called_tool per turn for weekly eval.

Results on a 2,000-ticket holdout: first-try correct tool 89% (was 54%), median latency 1.8s (was 4.2s), validation errors unchanged at 4% (schema quality was already good). The lesson: selection and schema design are complementary investments.

Technique decision table

Approach	Best when	Skip when
Expose full catalog	≤15 tools, single domain	Enterprise MCP meshes, multi-tenant SaaS
Embedding retrieval + top-k	40–500 tools, varied user phrasing	Strict compliance needs hard bundles only
Intent → tool bundles	Repeatable workflows, audit requirements	Exploratory coding agents, open research
Two-stage router tool	1,000+ tools, clear domain boundaries	Latency-sensitive voice interfaces
Static per-surface sets	Known UI with fixed capabilities	Users expect open-ended “do anything” agents

Common pitfalls

Sticky selection — reusing the previous turn's tool set when the user changes topic; re-run selection every turn.
Pruning core tools — retrieval drops escalate_to_human; maintain a pin list outside the budget cap.
Embedding stale descriptions — schema changes without re-indexing send the model outdated triggers.
Over-bundling — intent bundles with 40 tools recreate overload; keep bundles under 12.
Selection without execution auth — hiding admin_delete from the prompt but leaving the HTTP endpoint open.
No offline eval — cannot detect recall@k regressions when adding tools; maintain golden queries per intent.
Ignoring tool-call history — after a failed call, boost sibling tools or surface error-specific alternatives on retry.

Production checklist

Measure tool-definition token share of total prompt; set a hard ceiling.
Index every tool with name, description, domain tags, and example phrases.
Implement policy filter before relevance scoring (role, tenant, feature flags).
Pin safety and escalation tools outside the retrieval budget.
Re-run selection on each turn using latest user message and session summary.
Log offered tool IDs, scores, and final called tool for every turn.
Maintain golden-query eval sets per intent; track recall@k and wrong-tool rate.
Re-index embeddings when schemas or descriptions change.
Stage-gate write tools behind successful read prerequisites server-side.
Fall back to a smaller static bundle if retrieval service is down.
A/B test top-k and bundle sizes against latency and first-try success.
Document bundle ownership and review cadence with ops and compliance.

Key takeaways

Dynamic tool selection chooses which schemas the model sees each turn — essential once catalogs exceed a dozen tools.
Combine embedding retrieval, intent bundles, and pinned core tools; no single pattern scales every catalog.
Harbor Support raised first-try tool success from 54% to 89% without retraining by pruning 140 tools down to 8–12 per turn.
Selection saves context for customer data and policy; it does not replace schema quality or server-side authorization.
Log offered vs called tools and run recall@k evals whenever the registry grows.