Guide
Pydantic AI fundamentals explained
Most LLM agent code starts as a few strings and dicts, then grows into untyped glue that
breaks silently when models return malformed JSON. Pydantic AI is the agent
framework from the Pydantic team that treats types as the contract: you
declare an Agent with a result schema, inject dependencies through
RunContext, register validated tools, and let the library coerce model output
into Pydantic models — retrying or surfacing validation errors instead of passing bad
data downstream. Where
LangChain
generalizes chains across the ecosystem and
CrewAI
models role-based crews, Pydantic AI optimizes for Python services that need
reliable structured I/O — the same design philosophy that made Pydantic
standard for API request bodies, now applied to agent runs. This guide covers core
primitives, dependency injection, tools, result types, model providers, streaming and retries,
a Harbor Support ticket classifier worked example, a framework decision table, common
pitfalls, and a practitioner checklist.
What Pydantic AI is (and is not)
Pydantic AI is a type-first agent library for Python. You construct an
Agent with a system prompt, optional tool list, and a
result_type (a Pydantic BaseModel or built-in scalar). Each
run() or run_sync() call returns a RunResult whose
.data field is already validated — or raises a typed validation error
you can catch and retry with a repair prompt.
It is not a retrieval framework, a no-code agent builder, or a full graph orchestrator. Complex multi-day workflows with durable checkpoints and human interrupts may still need LangGraph. Reach for Pydantic AI when you are shipping FastAPI or background workers that call an LLM once or in a short tool loop, when downstream code expects typed objects not raw strings, and when you already use Pydantic for configuration and API schemas in the same codebase.
Core primitives
- Agent — configured with model, system prompt, tools, and result type.
- RunContext — per-run context carrying typed dependencies (DB sessions, user IDs, config).
- Tool — async or sync function registered on the agent; arguments validated via Pydantic.
- result_type — Pydantic model the final answer must conform to.
- RunResult — wrapper with
.data, usage metadata, and message history. - Model — provider abstraction (OpenAI, Anthropic, Gemini, Ollama, etc.) swappable per agent.
Your first agent: structured classification
The canonical pattern is a classifier or extractor with a fixed output schema:
from pydantic import BaseModel, Field
from pydantic_ai import Agent
class TicketCategory(BaseModel):
category: str = Field(description="billing, technical, or account")
urgency: int = Field(ge=1, le=5)
summary: str = Field(max_length=120)
agent = Agent(
'openai:gpt-4o-mini',
result_type=TicketCategory,
system_prompt=(
'Classify support tickets. '
'Never invent account details the user did not provide.'
),
)
result = agent.run_sync('My invoice is wrong and I need a refund today.')
print(result.data.category) # validated str
print(result.data.urgency) # validated int 1–5
The model may emit JSON in various shapes; Pydantic AI parses, validates, and — on
failure — can automatically re-prompt the model with validation errors (configurable
retry count). This loop is the framework’s main reliability win over hand-rolled
json.loads plus hope.
Dependency injection with RunContext
Production agents need databases, feature flags, and tenant scoping — not global
singletons. Pydantic AI passes a RunContext[DepsType] as the first argument
to tools and dynamic system prompts:
from dataclasses import dataclass
from pydantic_ai import Agent, RunContext
@dataclass
class SupportDeps:
tenant_id: str
db: Database
agent = Agent('openai:gpt-4o-mini', deps_type=SupportDeps, ...)
@agent.tool
async def lookup_order(ctx: RunContext[SupportDeps], order_id: str) -> dict:
return await ctx.deps.db.fetch_order(ctx.deps.tenant_id, order_id)
result = await agent.run(
'Where is order ORD-8821?',
deps=SupportDeps(tenant_id='harbor', db=db),
)
Dependencies are explicit per run, which simplifies testing: pass a mock SupportDeps
in unit tests without patching imports. Dynamic system prompts use the same pattern —
a function (ctx: RunContext[SupportDeps]) -> str registered with
@agent.system_prompt can inject tenant-specific policy text at runtime.
Tools: validated function calling
Tools are plain Python functions decorated with @agent.tool. Parameter types
become JSON schemas the model sees; return values are serialized back into the
conversation. Follow the same discipline as general
function calling:
- One responsibility per tool —
search_kbnotdo_everything. - Docstrings become tool descriptions; write them for the model, not for humans skimming code.
- Validate side effects inside the tool (auth checks, rate limits) — never trust the LLM to skip them.
- Return structured dicts or Pydantic models, not prose paragraphs the model must re-parse.
Pydantic AI runs a tool loop until the model produces a final answer matching
result_type or hits max_retries. Cap iterations in production
to control cost; log each tool invocation with tenant_id and latency.
Model providers and configuration
Models are referenced by string shorthand — 'openai:gpt-4o',
'anthropic:claude-3-5-sonnet-20241022', 'gemini-1.5-pro',
'ollama:llama3.2' — or constructed explicitly for custom endpoints.
API keys flow from environment variables per provider convention. Swap models per agent
without changing business logic: a cheap model for triage, a capable model for drafting.
For observability, enable OpenTelemetry instrumentation (built into recent Pydantic AI
releases) or wrap run() calls with your existing tracing middleware. Pair
with
LLM observability
practices: log prompts and completions with PII redaction, attribute cost per tenant, and
alert on validation-retry spikes (often a sign of schema drift or model downgrades).
Streaming
run_stream() yields partial text for UX-heavy surfaces (chat widgets, live
drafts). Structured result_type streaming validates incrementally where the
provider supports it; for strict schemas, many teams stream prose to the user but validate
the final object before persisting to the database.
Worked example: Harbor Support ticket classifier
Harbor Logistics runs a shared inbox for 40 franchise depots. Tickets arrive as unstructured email; routing to billing, dispatch, or compliance teams was manual and slow. A Pydantic AI agent replaces the rules engine:
- Schema —
RoutingDecisionwithqueueenum,priority,confidencefloat, andreasonstring capped at 200 chars. - Deps —
HarborDepswithdepot_id, read-only CRM client, and escalation policy version. - Tools —
get_open_shipments(dispatch context),get_invoice_status(billing), no write tools on the classifier itself. - Dynamic prompt — injects depot SLA table and forbidden auto-actions list from policy version.
- Guardrail — if
confidence < 0.7, route to human triage regardless of model queue choice.
FastAPI endpoint calls agent.run(message, deps=...), persists
result.data directly to Postgres JSONB — no intermediate parsing step.
Validation retries dropped from 12% of runs to under 2% after tightening field descriptions
and adding two few-shot examples in the system prompt.
Framework decision table
| Need | Pydantic AI | LangChain | CrewAI | LangGraph |
|---|---|---|---|---|
| Typed agent result objects | Native | Via parsers | Manual | Typed state |
| DI for DB/config per run | RunContext | Configurable | Limited | State reducers |
| Role-based multi-agent crews | No | Patterns | Native | Graph nodes |
| Durable checkpoints / HITL | Limited | LangGraph | Flow | Native |
| RAG document pipelines | Bring your own | Rich | Tools only | Bring your own |
| FastAPI microservice fit | Excellent | Good | Moderate | Moderate |
| YAML non-dev config | No | Some | Strong | Some |
Choose Pydantic AI for single-agent or short tool-loop services where Pydantic models are already the lingua franca. Escalate to LangGraph when workflows need branching graphs with interrupts; to CrewAI when personas and task pipelines are the primary abstraction.
Common pitfalls
- Over-wide result models — twenty optional fields the model guesses; keep schemas minimal and required fields explicit.
- Vague Field descriptions — validation retries burn tokens; descriptions are prompt engineering, not comments.
- Write tools without auth — the model will call them; enforce permissions inside every tool using
RunContextdeps. - Unbounded tool loops — default retries plus tool iterations can 10x cost; set caps and log runaway patterns.
- Mixing sync and async incorrectly — use
run_synconly in scripts; preferawait agent.run()in FastAPI and workers. - Skipping few-shot examples — structured output quality jumps with 2–3 in-prompt examples matching
result_type. - No fallback on validation failure — after max retries, route to human review instead of returning partial garbage.
- Ignoring TypeScript parity — teams on Node should compare with Zod validation patterns at API boundaries even when agents run in Python.
Production checklist
- Define
result_typeas a Pydantic model with Field descriptions for every property. - Create a
deps_typedataclass for DB clients, tenant scope, and feature flags. - Register tools with narrow signatures; enforce auth inside tool bodies via
ctx.deps. - Add 2–3 few-shot examples to the system prompt matching the result schema.
- Configure
max_retriesand tool iteration limits per environment. - Wire OpenTelemetry or existing tracing around
run()calls. - Log validation failures separately from model errors; alert on retry-rate spikes.
- Implement human fallback when confidence or validation fails after retries.
- Unit-test tools with mock
RunContextdeps; integration-test full runs with recorded fixtures. - Document model string and API key rotation in deployment config.
- Redact PII in logged prompts; retention policy aligned with compliance.
- Benchmark cost per classified ticket against a rules baseline quarterly.
Key takeaways
- Pydantic AI makes Pydantic models the contract for agent inputs, tools, dependencies, and final results.
- RunContext dependency injection keeps agents testable and tenant-safe without global state.
- Validation retries turn malformed model JSON from silent bugs into recoverable loops.
- Best fit: Python microservices needing reliable structured LLM output, not sprawling multi-agent graphs.
- Pair with LangGraph or CrewAI when orchestration complexity outgrows a single typed agent.
Related reading
- Python fundamentals explained — dataclasses, typing, and async patterns agents build on
- LLM function calling explained — designing tools agents invoke safely
- Zod fundamentals explained — runtime validation and type inference in TypeScript stacks
- LangChain fundamentals explained — broader orchestration when Pydantic AI is not enough