Guide
LLM tool schema design explained
Harbor Support's refund triage bot had six tools registered and a capable model, yet
31% of tool calls failed validation: invented order IDs, reason fields stuffed
with full customer rants instead of enum values, and amount_cents passed as
strings like “full refund.” Swapping the base model from GPT-4o to a newer
release barely moved the needle. Rewriting the tool schemas — clearer names, tighter
types, explicit format hints in descriptions, and splitting one overloaded
update_order into three atomic tools — dropped bad calls to 4% in a week
without retraining anything.
Tool schema design is the craft of defining what an agent is allowed to invoke: function names, parameter shapes, and the natural-language descriptions models read when choosing arguments. It sits upstream of function calling runtime plumbing and downstream of product intent: your schema is the contract between probabilistic language models and deterministic backends. This guide covers schema structure, parameter and description patterns, granularity tradeoffs, validation and coercion policy, versioning, the Harbor Support refactor, a technique decision table vs unstructured prompts, pitfalls, and a production checklist.
Schema structure basics
Most providers accept JSON Schema (draft 2020-12 or a subset) wrapped in a tool definition. Treat each tool as a small API endpoint the model can discover.
| Field | Purpose | Design note |
|---|---|---|
name |
Stable identifier in tool_call payloads | snake_case verb_noun: get_order, create_refund |
description |
When to call this tool (not how your code works) | Lead with trigger conditions; mention what it does not do |
parameters |
JSON Schema object for arguments | Keep additionalProperties: false in production |
required |
Minimum args before your handler runs | Prefer fewer required fields; use server defaults for the rest |
enum |
Closed set of allowed values | Pair every enum with a one-line meaning in the property description |
A minimal pattern:
{
"name": "get_order",
"description": "Fetch order details by ID. Use when the customer asks about status, shipping, or refunds. Do not call if no order ID is known.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "Order ID exactly as shown in confirmation email, format ORD- followed by 8 digits.",
"pattern": "^ORD-[0-9]{8}$"
}
},
"required": ["order_id"],
"additionalProperties": false
}
}
The pattern is enforced server-side even if the model ignores it; the
description teaches the model what valid IDs look like before validation fires.
Parameter description patterns
Models read parameter descriptions at inference time. Vague schemas produce vague calls.
- Name for semantics.
refund_reason_codebeatsreason;amount_centsbeatsamount. - Description = example + constraint. “ISO 8601 date in UTC, e.g. 2026-06-11. Must not be in the future.”
- Enum entries documented inline. List each value and when to pick it:
damaged_item(product arrived broken),not_as_described(listing mismatch). - Negative space. “Do not pass customer email here; use
lookup_customer_by_emailinstead.” - Cross-parameter hints. “Required when
refund_typeispartial; omit for full refunds.”
Descriptions are prompt tokens. Keep them dense, not essay-length. One to three sentences per property is usually enough if names are good.
Granularity taxonomy
How many tools and how wide each schema should be is the highest-leverage design choice.
| Style | Shape | Pros | Cons |
|---|---|---|---|
| Atomic read | One resource, one GET-like tool | Easy routing, clear errors, parallelizable | More tools in context window |
| Atomic write | One mutation per tool | Idempotency per action, audit trails | Multi-step flows need orchestration |
| Composite action | process_refund with many optional fields |
Fewer round trips | Model fills wrong combo; hard to test |
| Meta / planner tool | run_sql, execute_code |
Maximum flexibility | Safety risk; needs heavy guardrails |
| Router tool | search_knowledge_base with query only |
Hides backend complexity | Opaque failures; harder to debug |
Default to atomic reads and writes. Introduce composite tools only when latency dominates and you can validate the full parameter bundle deterministically. Meta tools belong behind guardrails and narrow allowlists, not as the first design.
Validation and coercion pipeline
Never trust model output. Your handler should run a strict pipeline before side effects:
- Parse JSON. Reject malformed payloads with a structured error the model can read on retry (see tool error handling).
- JSON Schema validate. Types, enums, patterns, min/max.
- Business rules. Order exists, user owns it, refund within policy window.
- Coerce cautiously. Trim strings and parse integers; do not guess missing required fields.
- Authorize. Map tool args to principal; reject cross-tenant access.
- Execute idempotently. Dedupe keys on writes.
Return validation errors as observations, not HTTP 500s. The model often self-corrects on the next turn when errors cite the exact field and allowed values.
Harbor Support refactor
Before: one update_order tool with twelve optional parameters covering
status changes, notes, refunds, and address edits. The model frequently mixed intents &
mdash; setting status: refunded without calling payment APIs.
After:
- Split into
get_order,get_shipment_status,check_refund_eligibility,create_refund,add_order_note. - Added
patternonorder_idand enum onrefund_reason_codewith per-value descriptions. - Moved free-text customer quotes to a separate
customer_messagefield on a read-only logging tool, not on write schemas. - Registered tool list filtered by intent routing so refund flows never saw address-edit tools.
- Versioned schemas in a registry tied to prompt versions for A/B tests.
Bad tool call rate: 31% to 4%. Median turns to resolution dropped from 3.8 to 2.4 because the model stopped recovering from self-inflicted validation loops.
Technique decision table
| Problem | Prefer structured tool schemas | Prefer unstructured / prompt-only |
|---|---|---|
| Calling production APIs with side effects | Yes — validation is mandatory | No |
| Extracting one field from user text | Often overkill | Yes — use structured output on the reply, not a tool |
| 10+ backend operations per session | Yes — atomic tools + routing | No — model invents parameters |
| Exploratory analysis (SQL, notebooks) | Meta tools with guardrails | Raw prompt injection of queries is unsafe |
| FAQ with no backend | No tools needed | RAG or static prompt suffices |
| Multi-tool parallel reads | Yes — narrow schemas enable parallel calls | N/A |
Common pitfalls
- Overloaded tools. One function with many optional params hides required combos from the model.
- Generic parameter names.
id,type,datacollide across tools in the same context. - Missing format hints. Models guess date formats; always specify.
- Enum without documentation. Raw
["a","b","c"]forces hallucinated mappings. - Too many tools in one prompt. Past ~15–20 tools, routing accuracy falls; use intent-based subsets.
- Schema drift without versioning. Changing enums breaks in-flight sessions; version and deprecate.
- Descriptions that document implementation. “Calls Postgres table orders_v2” does not help the model pick arguments.
- No server validation. Trusting model JSON is how refunds hit wrong accounts.
Production checklist
- Every tool has a verb_noun name and a when-to-call description.
additionalProperties: falseon all parameter objects.- Enums documented per value; patterns on IDs and codes.
- Atomic writes; composite tools justified and tested.
- JSON Schema validation before any side effect.
- Structured validation errors returned as tool observations.
- Tool list filtered by intent or route where catalog exceeds ~15 tools.
- Schemas versioned alongside prompts in a registry.
- Golden eval set of user utterances to expected tool + args.
- Metrics: tool call rate, validation failure rate, retry count per tool.
- Idempotency keys on all mutating tools.
- Runbook for rolling back schema changes under incident.
Key takeaways
- Schemas are prompts. Parameter descriptions teach the model what to pass.
- Atomic tools beat Swiss Army functions. Split overloaded endpoints.
- Validate everything server-side. Models are probabilistic; APIs are not.
- Route tools by intent. Smaller catalogs route more accurately.
- Measure validation failures. They predict production pain before users do.
Related reading
- LLM function calling explained — provider APIs and runtime registration
- LLM structured outputs explained — JSON mode vs tool calls for extraction
- LLM tool error handling explained — validation errors as observations
- LLM output parsing and validation explained — post-model parsing pipelines