Guide

LLM function calling explained

A customer asks Harbor Commerce support: “Where is order HC-88421 and can I change the shipping address?” A plain chat model can only guess. Function calling (also called tool use) lets the model emit a structured request — lookup_order(order_id="HC-88421") — that your application executes and feeds back as a tool result message. The model then composes a grounded answer from real data. This pattern powers order lookup, calendar booking, database queries, code execution, and the action layer behind autonomous agents. This guide covers JSON Schema tool definitions, the multi-turn call loop, parallel versus sequential execution, provider API differences (OpenAI, Anthropic, Google Gemini), strict mode and validation, security guardrails, a Harbor Commerce order assistant worked example, a decision table versus structured outputs and ReAct prompting, common pitfalls, and a production checklist — assuming basic familiarity with prompt engineering.

What function calling is

Function calling is a contract between your application and an LLM: you declare a set of tools (functions the model may invoke), each with a name, natural-language description, and a JSON Schema describing its parameters. At inference time the model can either reply with user-facing text or return a tool call — a JSON object naming the function and supplying arguments that conform to the schema.

Your code — not the model — runs the actual function. You append a tool role message (or provider-specific equivalent) with the result, then call the model again. This loop continues until the model produces a final natural-language answer or you hit a step limit. The model never directly touches your database; it only proposes typed actions your server validates and executes.

Function calling vs structured outputs

Both patterns constrain model output to JSON. The difference is intent:

Structured outputs — the model’s entire response is one JSON document matching a schema (classification labels, extracted entities, report fields). Best when you need a single parsed object per turn.
Function calling — the model chooses which of several registered tools to invoke, with arguments, often across multiple turns. Best when the action space is dynamic and the model must decide what to do next.

Many production systems combine both: function calling for the agent loop, structured outputs inside individual tool handlers that return normalized data.

Designing tool schemas

Schema quality determines reliability. Treat each tool like a public API endpoint: narrow scope, explicit types, and descriptions written for the model, not just developers.

Schema anatomy

name — snake_case identifier; verbs help (get_order_status, not order).
description — when to use this tool, what it returns, and when not to use it. Mention required ID formats.
parameters — JSON Schema object with type, properties, required, and per-field descriptions.

Example tool definition (OpenAI-compatible shape):

{
  "type": "function",
  "function": {
    "name": "lookup_order",
    "description": "Fetch order status, items, and shipping for a Harbor Commerce order ID (format HC-#####). Use when the user asks about a specific order.",
    "parameters": {
      "type": "object",
      "properties": {
        "order_id": {
          "type": "string",
          "pattern": "^HC-\\d{5}$",
          "description": "Harbor order ID, e.g. HC-88421"
        }
      },
      "required": ["order_id"],
      "additionalProperties": false
    }
  }
}

Schema design rules

Prefer fewer, focused tools over one mega-function with twenty optional parameters.
Use enum for closed sets (shipping carriers, ticket priorities).
Set additionalProperties: false when providers support strict mode — reduces hallucinated fields.
Return compact tool results; truncate large lists and paginate via follow-up calls.
Version tools in the name (search_products_v2) when breaking changes ship.

The multi-turn call loop

A minimal production loop looks like this:

Send system prompt + conversation history + tool definitions to the model.
If the response contains tool calls, parse each call’s name and arguments.
Validate arguments against your schema (and business rules) before execution.
Execute handlers server-side with authenticated service credentials.
Append tool result messages and call the model again.
Repeat until the model returns text only, or cap iterations (typically 5–10).

Parallel vs sequential tool calls

Modern models can emit multiple tool calls in one turn when requests are independent — e.g. lookup_order and get_shipping_options for the same user message. Execute independent calls concurrently to reduce latency. Dependent calls (create draft, then confirm) must stay sequential across turns.

Always enforce a max parallel fan-out and total step budget. Unbounded loops are a common source of runaway token bills and accidental API hammering.

Error handling in the loop

When a tool fails, return a structured error string in the tool message ({"error": "order_not_found", "order_id": "HC-99999"}). Models recover well from explicit errors; silent empty results encourage fabrication. Log failures with correlation IDs for debugging.

Provider API differences

The conceptual loop is the same; wire formats differ. Abstract behind a thin adapter if you swap providers.

OpenAI

Chat Completions and the newer Responses API accept a tools array. Set tool_choice to auto, required, or a specific function. Strict mode (strict: true on function definitions) guarantees schema adherence via constrained decoding — use it for production when available. Parallel tool calls are enabled by default; disable with parallel_tool_calls: false when ordering matters within a single turn.

Anthropic

Claude uses a tools array with input_schema (JSON Schema). Tool use blocks appear in the assistant message; you reply with tool_result blocks. tool_choice supports auto, any, or a named tool. Claude is strong at multi-step reasoning; still cap tool iterations.

Google Gemini

Gemini exposes functionDeclarations in tools. The model returns functionCall parts; respond with functionResponse. Mode ANY forces a tool call; AUTO lets the model choose. Validate argument shapes — provider strictness varies by model version.

Frameworks like LangChain normalize these differences behind bind_tools() and agent executors, at the cost of abstraction leakage when providers ship new features first.

Security and guardrails

Function calling turns natural language into executable actions. Treat it as a privilege boundary, not a convenience feature.

Allowlist tools per user role — guests get lookup_order; agents get issue_refund behind approval workflows.
Never trust model-generated SQL or shell — wrap data access in parameterized handlers; reject raw query strings.
Authenticate outbound calls with service accounts, not user-supplied tokens embedded in prompts.
Rate-limit and audit destructive tools (refunds, deletes, sends).
Human-in-the-loop for irreversible actions above a dollar threshold.
Prompt injection defense — untrusted document text in RAG context can trick the model into calling tools; separate system instructions from retrieved content and scan for override attempts.

Pair function calling with input/output filters and structured logging. The model is an untrusted planner; your handlers are the trusted execution layer.

Worked example: Harbor Commerce order assistant

Harbor Commerce ships a support chatbot backed by three tools:

lookup_order(order_id) — returns status, line items, tracking URL.
update_shipping_address(order_id, address) — allowed only before shipped status; requires address validation.
create_return_ticket(order_id, reason) — opens a Zendesk ticket; idempotent on duplicate calls.

User: “HC-88421 hasn’t arrived — can you check and start a return if it’s lost?”

Turn 1: Model calls lookup_order("HC-88421"). Handler returns {"status": "in_transit", "carrier": "UPS", "eta": "2026-06-11", "tracking": "..."}.

Turn 2: Model responds in natural language with ETA and tracking link; does not open a return because status is not delivered/lost. If the user insists, turn 3 may call create_return_ticket after confirming eligibility rules in the system prompt.

Key design choices: shipping updates are blocked at the handler (not just the prompt); return tickets log order_id + conversation_id; tool results strip PII before logging. Average resolution: two model round-trips, ~1,200 input tokens with trimmed order payloads.

Decision table: when to use what

Need	Function calling	Structured outputs	ReAct / text tools
Dynamic tool selection from a registry	Best fit	Poor	OK
Single JSON extraction per turn	Overkill	Best fit	Poor
Multi-step workflows with branching	Best fit	Manual orchestration	OK for prototypes
Provider-native strict schema guarantees	OpenAI strict tools	Structured outputs API	None
Debugging transparency	Typed call logs	Single blob	Parse action lines
Latency-sensitive single lookup	One call OK	Often faster	Slower

Common pitfalls

Tool sprawl — twenty overlapping tools confuse the model; merge or namespace by domain.
Vague descriptions — “search database” causes wrong-tool picks; specify inputs and scope.
Returning huge JSON blobs — blows context and cost; summarize and offer pagination tools.
No server-side validation — models invent plausible but invalid IDs; always validate before IO.
Unbounded loops — model retries the same failed call forever; cap steps and detect repetition.
Mixing user content into system prompts — injection via pasted order notes or email bodies.
Ignoring latency — serial tool calls across five round-trips feel broken in chat; parallelize and stream interim status.

Production checklist

Define tools with JSON Schema, additionalProperties: false, and model-facing descriptions.
Implement validate-then-execute handlers; never eval model output.
Cap tool iterations and parallel fan-out per request.
Return structured errors in tool messages; log with correlation IDs.
Enable provider strict mode where supported for argument conformance.
Role-scope tools; require approval for destructive operations.
Truncate and redact tool results before re-prompting and logging.
Monitor tool call rates, failure types, and token cost per conversation.
Integration-test golden paths (happy path, not found, permission denied).
Document tool registry versions and deprecation windows for downstream agents.

Key takeaways

Function calling lets models propose typed actions; your server executes and returns results.
Invest in schema design and descriptions — they matter more than clever system prompts.
The production loop is: call → validate → execute → append tool results → repeat until done.
Use strict mode and server validation together; neither alone is sufficient for security.
Pair tools with frameworks like LangChain or MCP when the action space grows beyond a handful of functions.