Guide

AI agents and tool use explained

A chatbot answers in one shot. An AI agent runs a loop: observe context, decide what to do next, call external tools (search, databases, APIs, code execution), read the results, and repeat until the task is done. That pattern — often called tool use or function calling — is what powers coding assistants, research bots, customer-support automations, and autonomous operators. This guide explains how the loop works, how models choose tools, where memory fits, what can go wrong, and when a simpler fixed workflow is the better engineering choice.

Agents vs single-turn chat

In a standard chat completion, the model maps one prompt to one response. The model cannot fetch live data, run SQL, or send email unless you pre-inject everything into the prompt — which breaks down as tasks grow.

An agent adds an orchestration layer around the model. The runtime exposes a catalog of tools with JSON schemas describing their parameters. The model outputs structured "call this function with these arguments" messages instead of plain text. Your code executes the tool, appends the result to the conversation history, and asks the model again. The user sees a final answer; behind the scenes there may have been five or fifty internal steps.

"Agentic" does not mean fully autonomous or unsupervised. Production agents almost always have step limits, approval gates for dangerous actions, and human review for high-stakes outputs.

How tool and function calling works

The tool catalog

Each tool is defined with a name, natural-language description, and a parameter schema (usually JSON Schema). Descriptions matter enormously — the model picks tools based on names and docstrings, not magic. Vague descriptions like "search" cause wrong picks; precise ones like "search internal wiki by keyword, returns top 5 markdown snippets" work better.

The call-and-execute cycle

  1. User message arrives with optional system instructions.
  2. Model responds with zero or more tool calls (function name + JSON args).
  3. Runtime validates args, executes tools (often in parallel when independent), catches errors.
  4. Tool results are appended as tool role messages.
  5. Model is invoked again until it returns a final natural-language answer or hits a step cap.

Major APIs (OpenAI, Anthropic, Google Gemini, open-weight stacks via vLLM) all converged on this shape. Frameworks like LangChain, LlamaIndex, AutoGen, and the OpenAI Agents SDK wrap the same loop with retries, tracing, and memory helpers.

Structured outputs vs tool calls

Structured output forces the model to emit JSON matching a schema — useful for extraction and classification. Tool calls imply side effects or external I/O. Confusing the two leads to bugs: do not mark "parse this invoice" as a tool if it is pure transformation; use structured output instead and reserve tools for actions that touch the outside world.

Planning loops: ReAct and beyond

The ReAct pattern (Reason + Act) interleaves chain-of-thought reasoning with tool use: the model writes a short plan, calls a tool, observes the result, updates its plan, and continues. Explicit reasoning traces improve reliability on multi-hop questions ("What was our Q3 revenue in the region where our largest customer is headquartered?").

Plan-and-execute variants

Some systems split roles: a planner model drafts a step list, an executor model carries out each step with tools, and a verifier checks results before the user sees them. This costs more tokens but reduces aimless looping. Other designs use a cheap model for routing and an expensive model only for synthesis.

When loops fail

  • Infinite retry — the model keeps calling the same failing API; fix with max steps and duplicate-call detection.
  • Tool hallucination — the model invents function names not in the catalog; validate strictly and return a clear error message back to the model.
  • Premature answer — the model guesses instead of looking up data; system prompts should require citation or tool evidence for factual claims.

Memory, state, and context

Every tool result and reasoning step consumes context window tokens. Long agent runs fill the window quickly. Common strategies:

  • Sliding window — drop oldest tool traces, keep system prompt and recent turns.
  • Summarization — compress completed sub-tasks into a paragraph before continuing.
  • External memory — write facts to a database or vector store; retrieve only what is relevant for the current step.
  • Scratchpad files — coding agents write notes to disk the same way a human keeps a todo list open.

Session state (user ID, permissions, draft order ID) should live in your application layer, not in the model's weights. Pass it in on each turn as structured metadata the model can reference but not overwrite.

RAG as a retrieval tool

Retrieval-augmented generation fits naturally into agent architectures as one tool among many — search_knowledge_base(query) — rather than stuffing all documents into every prompt. The agent decides when to retrieve, what query to run, and whether results are sufficient before answering.

Agentic RAG goes further: the model may rewrite queries, fetch multiple chunks, compare sources, and call retrieval again if the first pass was weak. That flexibility beats single-shot RAG on complex questions but multiplies latency and cost. See our RAG guide for chunking, hybrid search, and evaluation — agents still need those foundations to work.

Multi-agent and handoff patterns

Instead of one generalist agent with twenty tools, teams often deploy specialist agents: a researcher that only searches, a coder that only edits files, a reviewer that only critiques. A router or supervisor model delegates sub-tasks and merges outputs.

Handoffs work when boundaries are clear. They fail when agents argue, duplicate work, or pass unstructured blobs no downstream agent can parse. Define shared message formats (JSON with fixed fields) and cap delegation depth.

Human-in-the-loop fits here: pause before irreversible actions (payments, deletes, public posts) and resume when approved. Wallet and payment flows especially need explicit user confirmation — autonomy over money without guardrails is a liability.

Safety, permissions, and abuse

Tools are code execution with extra steps. An agent that can read email, browse the web, and run shell commands is a high-value target for prompt injection: untrusted content in a webpage or email can instruct the model to exfiltrate data or misuse tools.

Practical guardrails

  • Least privilege — each tool gets only the scopes it needs; separate read and write tools.
  • Allowlists — HTTP tools fetch only approved domains; SQL tools query views, not raw tables.
  • Human approval — gate sends, purchases, and permission changes.
  • Input sanitization — treat all retrieved text as hostile; never let it override system instructions.
  • Output filtering — block secrets, PII, and policy violations before returning to users.
  • Audit logs — record every tool call with arguments and results for forensics.

Defense in depth beats a single "do not do bad things" system prompt. Models forget; permissions enforced in code do not.

Cost, latency, and reliability

Agents are expensive relative to one-shot chat. Each loop adds a full model inference; tool latency (slow APIs, large PDF parsing) stacks serially unless you parallelize. Empirical studies of agentic software engineering show most token spend lands in refinement loops — re-reading files, re-running tests — not the first draft. Budget accordingly.

  • Set max steps and timeouts per tool.
  • Use smaller models for routing and tool selection; reserve large models for final answers.
  • Cache idempotent reads (search results, API lookups) within a session.
  • Instrument with metrics, logs, and traces — track cost per task, success rate, and which tools fail most.

Reliability improves when you add deterministic fallbacks: if the agent cannot complete a booking in ten steps, hand off to a human queue with full context attached.

When to use agents vs fixed workflows

Not every problem needs autonomy. Prefer a fixed pipeline when:

  • Steps are always the same (ingest CSV, validate schema, load warehouse).
  • Failure modes must be predictable for compliance.
  • Latency and cost budgets are tight.

Prefer an agent when:

  • User intent varies widely and is hard to enumerate upfront.
  • The task requires combining multiple data sources dynamically.
  • You can tolerate occasional failure with graceful escalation.

Many production systems are hybrids: a workflow handles the happy path; an agent handles edge cases and natural-language overrides. That split often delivers 80% of the value at 20% of the risk.

Building checklist

  1. Define success metrics (task completion rate, human takeover rate, cost per task).
  2. Start with the smallest tool set that solves one real user job.
  3. Write tool descriptions as if onboarding a junior developer.
  4. Add step limits, structured error messages, and parallel execution where safe.
  5. Log every turn; replay failures in staging.
  6. Red-team with injection payloads in retrieved content.
  7. Compare against a non-agent baseline — agents should win on outcomes, not hype.

Key takeaways

  • AI agents are loops around a language model, not smarter chatbots.
  • Tool calling lets models act on the world; schemas and descriptions determine reliability.
  • ReAct-style planning and specialist sub-agents help on multi-step tasks but cost tokens.
  • Memory and RAG should be explicit tools, not infinite context.
  • Permissions and injection defenses belong in code, not prompts alone.
  • Use agents where flexibility matters; use workflows where predictability matters.

Related reading