Guide

Pydantic AI fundamentals explained

Most LLM agent code starts as a few strings and dicts, then grows into untyped glue that breaks silently when models return malformed JSON. Pydantic AI is the agent framework from the Pydantic team that treats types as the contract: you declare an Agent with a result schema, inject dependencies through RunContext, register validated tools, and let the library coerce model output into Pydantic models — retrying or surfacing validation errors instead of passing bad data downstream. Where LangChain generalizes chains across the ecosystem and CrewAI models role-based crews, Pydantic AI optimizes for Python services that need reliable structured I/O — the same design philosophy that made Pydantic standard for API request bodies, now applied to agent runs. This guide covers core primitives, dependency injection, tools, result types, model providers, streaming and retries, a Harbor Support ticket classifier worked example, a framework decision table, common pitfalls, and a practitioner checklist.

What Pydantic AI is (and is not)

Pydantic AI is a type-first agent library for Python. You construct an Agent with a system prompt, optional tool list, and a result_type (a Pydantic BaseModel or built-in scalar). Each run() or run_sync() call returns a RunResult whose .data field is already validated — or raises a typed validation error you can catch and retry with a repair prompt.

It is not a retrieval framework, a no-code agent builder, or a full graph orchestrator. Complex multi-day workflows with durable checkpoints and human interrupts may still need LangGraph. Reach for Pydantic AI when you are shipping FastAPI or background workers that call an LLM once or in a short tool loop, when downstream code expects typed objects not raw strings, and when you already use Pydantic for configuration and API schemas in the same codebase.

Core primitives

Agent — configured with model, system prompt, tools, and result type.
RunContext — per-run context carrying typed dependencies (DB sessions, user IDs, config).
Tool — async or sync function registered on the agent; arguments validated via Pydantic.
result_type — Pydantic model the final answer must conform to.
RunResult — wrapper with .data, usage metadata, and message history.
Model — provider abstraction (OpenAI, Anthropic, Gemini, Ollama, etc.) swappable per agent.

Your first agent: structured classification

The canonical pattern is a classifier or extractor with a fixed output schema:

from pydantic import BaseModel, Field
from pydantic_ai import Agent

class TicketCategory(BaseModel):
    category: str = Field(description="billing, technical, or account")
    urgency: int = Field(ge=1, le=5)
    summary: str = Field(max_length=120)

agent = Agent(
    'openai:gpt-4o-mini',
    result_type=TicketCategory,
    system_prompt=(
        'Classify support tickets. '
        'Never invent account details the user did not provide.'
    ),
)

result = agent.run_sync('My invoice is wrong and I need a refund today.')
print(result.data.category)   # validated str
print(result.data.urgency)    # validated int 1–5

The model may emit JSON in various shapes; Pydantic AI parses, validates, and — on failure — can automatically re-prompt the model with validation errors (configurable retry count). This loop is the framework’s main reliability win over hand-rolled json.loads plus hope.

Dependency injection with RunContext

Production agents need databases, feature flags, and tenant scoping — not global singletons. Pydantic AI passes a RunContext[DepsType] as the first argument to tools and dynamic system prompts:

from dataclasses import dataclass
from pydantic_ai import Agent, RunContext

@dataclass
class SupportDeps:
    tenant_id: str
    db: Database

agent = Agent('openai:gpt-4o-mini', deps_type=SupportDeps, ...)

@agent.tool
async def lookup_order(ctx: RunContext[SupportDeps], order_id: str) -> dict:
    return await ctx.deps.db.fetch_order(ctx.deps.tenant_id, order_id)

result = await agent.run(
    'Where is order ORD-8821?',
    deps=SupportDeps(tenant_id='harbor', db=db),
)

Dependencies are explicit per run, which simplifies testing: pass a mock SupportDeps in unit tests without patching imports. Dynamic system prompts use the same pattern — a function (ctx: RunContext[SupportDeps]) -> str registered with @agent.system_prompt can inject tenant-specific policy text at runtime.

Tools: validated function calling

Tools are plain Python functions decorated with @agent.tool. Parameter types become JSON schemas the model sees; return values are serialized back into the conversation. Follow the same discipline as general function calling:

One responsibility per tool — search_kb not do_everything.
Docstrings become tool descriptions; write them for the model, not for humans skimming code.
Validate side effects inside the tool (auth checks, rate limits) — never trust the LLM to skip them.
Return structured dicts or Pydantic models, not prose paragraphs the model must re-parse.

Pydantic AI runs a tool loop until the model produces a final answer matching result_type or hits max_retries. Cap iterations in production to control cost; log each tool invocation with tenant_id and latency.

Model providers and configuration

Models are referenced by string shorthand — 'openai:gpt-4o', 'anthropic:claude-3-5-sonnet-20241022', 'gemini-1.5-pro', 'ollama:llama3.2' — or constructed explicitly for custom endpoints. API keys flow from environment variables per provider convention. Swap models per agent without changing business logic: a cheap model for triage, a capable model for drafting.

For observability, enable OpenTelemetry instrumentation (built into recent Pydantic AI releases) or wrap run() calls with your existing tracing middleware. Pair with LLM observability practices: log prompts and completions with PII redaction, attribute cost per tenant, and alert on validation-retry spikes (often a sign of schema drift or model downgrades).

Streaming

run_stream() yields partial text for UX-heavy surfaces (chat widgets, live drafts). Structured result_type streaming validates incrementally where the provider supports it; for strict schemas, many teams stream prose to the user but validate the final object before persisting to the database.

Worked example: Harbor Support ticket classifier

Harbor Logistics runs a shared inbox for 40 franchise depots. Tickets arrive as unstructured email; routing to billing, dispatch, or compliance teams was manual and slow. A Pydantic AI agent replaces the rules engine:

Schema — RoutingDecision with queue enum, priority, confidence float, and reason string capped at 200 chars.
Deps — HarborDeps with depot_id, read-only CRM client, and escalation policy version.
Tools — get_open_shipments (dispatch context), get_invoice_status (billing), no write tools on the classifier itself.
Dynamic prompt — injects depot SLA table and forbidden auto-actions list from policy version.
Guardrail — if confidence < 0.7, route to human triage regardless of model queue choice.

FastAPI endpoint calls agent.run(message, deps=...), persists result.data directly to Postgres JSONB — no intermediate parsing step. Validation retries dropped from 12% of runs to under 2% after tightening field descriptions and adding two few-shot examples in the system prompt.

Framework decision table

Need	Pydantic AI	LangChain	CrewAI	LangGraph
Typed agent result objects	Native	Via parsers	Manual	Typed state
DI for DB/config per run	RunContext	Configurable	Limited	State reducers
Role-based multi-agent crews	No	Patterns	Native	Graph nodes
Durable checkpoints / HITL	Limited	LangGraph	Flow	Native
RAG document pipelines	Bring your own	Rich	Tools only	Bring your own
FastAPI microservice fit	Excellent	Good	Moderate	Moderate
YAML non-dev config	No	Some	Strong	Some

Choose Pydantic AI for single-agent or short tool-loop services where Pydantic models are already the lingua franca. Escalate to LangGraph when workflows need branching graphs with interrupts; to CrewAI when personas and task pipelines are the primary abstraction.

Common pitfalls

Over-wide result models — twenty optional fields the model guesses; keep schemas minimal and required fields explicit.
Vague Field descriptions — validation retries burn tokens; descriptions are prompt engineering, not comments.
Write tools without auth — the model will call them; enforce permissions inside every tool using RunContext deps.
Unbounded tool loops — default retries plus tool iterations can 10x cost; set caps and log runaway patterns.
Mixing sync and async incorrectly — use run_sync only in scripts; prefer await agent.run() in FastAPI and workers.
Skipping few-shot examples — structured output quality jumps with 2–3 in-prompt examples matching result_type.
No fallback on validation failure — after max retries, route to human review instead of returning partial garbage.
Ignoring TypeScript parity — teams on Node should compare with Zod validation patterns at API boundaries even when agents run in Python.

Production checklist

Define result_type as a Pydantic model with Field descriptions for every property.
Create a deps_type dataclass for DB clients, tenant scope, and feature flags.
Register tools with narrow signatures; enforce auth inside tool bodies via ctx.deps.
Add 2–3 few-shot examples to the system prompt matching the result schema.
Configure max_retries and tool iteration limits per environment.
Wire OpenTelemetry or existing tracing around run() calls.
Log validation failures separately from model errors; alert on retry-rate spikes.
Implement human fallback when confidence or validation fails after retries.
Unit-test tools with mock RunContext deps; integration-test full runs with recorded fixtures.
Document model string and API key rotation in deployment config.
Redact PII in logged prompts; retention policy aligned with compliance.
Benchmark cost per classified ticket against a rules baseline quarterly.

Key takeaways

Pydantic AI makes Pydantic models the contract for agent inputs, tools, dependencies, and final results.
RunContext dependency injection keeps agents testable and tenant-safe without global state.
Validation retries turn malformed model JSON from silent bugs into recoverable loops.
Best fit: Python microservices needing reliable structured LLM output, not sprawling multi-agent graphs.
Pair with LangGraph or CrewAI when orchestration complexity outgrows a single typed agent.