Guide

LLM tool schema design explained

Harbor Support's refund triage bot had six tools registered and a capable model, yet 31% of tool calls failed validation: invented order IDs, reason fields stuffed with full customer rants instead of enum values, and amount_cents passed as strings like “full refund.” Swapping the base model from GPT-4o to a newer release barely moved the needle. Rewriting the tool schemas — clearer names, tighter types, explicit format hints in descriptions, and splitting one overloaded update_order into three atomic tools — dropped bad calls to 4% in a week without retraining anything.

Tool schema design is the craft of defining what an agent is allowed to invoke: function names, parameter shapes, and the natural-language descriptions models read when choosing arguments. It sits upstream of function calling runtime plumbing and downstream of product intent: your schema is the contract between probabilistic language models and deterministic backends. This guide covers schema structure, parameter and description patterns, granularity tradeoffs, validation and coercion policy, versioning, the Harbor Support refactor, a technique decision table vs unstructured prompts, pitfalls, and a production checklist.

Schema structure basics

Most providers accept JSON Schema (draft 2020-12 or a subset) wrapped in a tool definition. Treat each tool as a small API endpoint the model can discover.

Field	Purpose	Design note
`name`	Stable identifier in tool_call payloads	snake_case verb_noun: `get_order`, `create_refund`
`description`	When to call this tool (not how your code works)	Lead with trigger conditions; mention what it does not do
`parameters`	JSON Schema object for arguments	Keep `additionalProperties: false` in production
`required`	Minimum args before your handler runs	Prefer fewer required fields; use server defaults for the rest
`enum`	Closed set of allowed values	Pair every enum with a one-line meaning in the property description

A minimal pattern:

{
  "name": "get_order",
  "description": "Fetch order details by ID. Use when the customer asks about status, shipping, or refunds. Do not call if no order ID is known.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "description": "Order ID exactly as shown in confirmation email, format ORD- followed by 8 digits.",
        "pattern": "^ORD-[0-9]{8}$"
      }
    },
    "required": ["order_id"],
    "additionalProperties": false
  }
}

The pattern is enforced server-side even if the model ignores it; the description teaches the model what valid IDs look like before validation fires.

Parameter description patterns

Models read parameter descriptions at inference time. Vague schemas produce vague calls.

Name for semantics. refund_reason_code beats reason; amount_cents beats amount.
Description = example + constraint. “ISO 8601 date in UTC, e.g. 2026-06-11. Must not be in the future.”
Enum entries documented inline. List each value and when to pick it: damaged_item (product arrived broken), not_as_described (listing mismatch).
Negative space. “Do not pass customer email here; use lookup_customer_by_email instead.”
Cross-parameter hints. “Required when refund_type is partial; omit for full refunds.”

Descriptions are prompt tokens. Keep them dense, not essay-length. One to three sentences per property is usually enough if names are good.

Granularity taxonomy

How many tools and how wide each schema should be is the highest-leverage design choice.

Style	Shape	Pros	Cons
Atomic read	One resource, one GET-like tool	Easy routing, clear errors, parallelizable	More tools in context window
Atomic write	One mutation per tool	Idempotency per action, audit trails	Multi-step flows need orchestration
Composite action	`process_refund` with many optional fields	Fewer round trips	Model fills wrong combo; hard to test
Meta / planner tool	`run_sql`, `execute_code`	Maximum flexibility	Safety risk; needs heavy guardrails
Router tool	`search_knowledge_base` with query only	Hides backend complexity	Opaque failures; harder to debug

Default to atomic reads and writes. Introduce composite tools only when latency dominates and you can validate the full parameter bundle deterministically. Meta tools belong behind guardrails and narrow allowlists, not as the first design.

Validation and coercion pipeline

Never trust model output. Your handler should run a strict pipeline before side effects:

Parse JSON. Reject malformed payloads with a structured error the model can read on retry (see tool error handling).
JSON Schema validate. Types, enums, patterns, min/max.
Business rules. Order exists, user owns it, refund within policy window.
Coerce cautiously. Trim strings and parse integers; do not guess missing required fields.
Authorize. Map tool args to principal; reject cross-tenant access.
Execute idempotently. Dedupe keys on writes.

Return validation errors as observations, not HTTP 500s. The model often self-corrects on the next turn when errors cite the exact field and allowed values.

Harbor Support refactor

Before: one update_order tool with twelve optional parameters covering status changes, notes, refunds, and address edits. The model frequently mixed intents & mdash; setting status: refunded without calling payment APIs.

After:

Split into get_order, get_shipment_status, check_refund_eligibility, create_refund, add_order_note.
Added pattern on order_id and enum on refund_reason_code with per-value descriptions.
Moved free-text customer quotes to a separate customer_message field on a read-only logging tool, not on write schemas.
Registered tool list filtered by intent routing so refund flows never saw address-edit tools.
Versioned schemas in a registry tied to prompt versions for A/B tests.

Bad tool call rate: 31% to 4%. Median turns to resolution dropped from 3.8 to 2.4 because the model stopped recovering from self-inflicted validation loops.

Technique decision table

Problem	Prefer structured tool schemas	Prefer unstructured / prompt-only
Calling production APIs with side effects	Yes — validation is mandatory	No
Extracting one field from user text	Often overkill	Yes — use structured output on the reply, not a tool
10+ backend operations per session	Yes — atomic tools + routing	No — model invents parameters
Exploratory analysis (SQL, notebooks)	Meta tools with guardrails	Raw prompt injection of queries is unsafe
FAQ with no backend	No tools needed	RAG or static prompt suffices
Multi-tool parallel reads	Yes — narrow schemas enable parallel calls	N/A

Common pitfalls

Overloaded tools. One function with many optional params hides required combos from the model.
Generic parameter names. id, type, data collide across tools in the same context.
Missing format hints. Models guess date formats; always specify.
Enum without documentation. Raw ["a","b","c"] forces hallucinated mappings.
Too many tools in one prompt. Past ~15–20 tools, routing accuracy falls; use intent-based subsets.
Schema drift without versioning. Changing enums breaks in-flight sessions; version and deprecate.
Descriptions that document implementation. “Calls Postgres table orders_v2” does not help the model pick arguments.
No server validation. Trusting model JSON is how refunds hit wrong accounts.

Production checklist

Every tool has a verb_noun name and a when-to-call description.
additionalProperties: false on all parameter objects.
Enums documented per value; patterns on IDs and codes.
Atomic writes; composite tools justified and tested.
JSON Schema validation before any side effect.
Structured validation errors returned as tool observations.
Tool list filtered by intent or route where catalog exceeds ~15 tools.
Schemas versioned alongside prompts in a registry.
Golden eval set of user utterances to expected tool + args.
Metrics: tool call rate, validation failure rate, retry count per tool.
Idempotency keys on all mutating tools.
Runbook for rolling back schema changes under incident.

Key takeaways

Schemas are prompts. Parameter descriptions teach the model what to pass.
Atomic tools beat Swiss Army functions. Split overloaded endpoints.
Validate everything server-side. Models are probabilistic; APIs are not.
Route tools by intent. Smaller catalogs route more accurately.
Measure validation failures. They predict production pain before users do.