Guide
DSPy fundamentals explained
Most teams ship LLM features by hand-tuning prompts in a notebook, then watching quality collapse when the model vendor ships an update. DSPy (Declarative Self-improving Python) from Stanford’s NLP group reframes the problem: you write programs composed of typed Signatures and reusable Modules, then let teleprompters (optimizers) search for better prompts, few-shot examples, and even weights — against a metric you define on a dev set. Where LangChain generalizes chains and Pydantic AI enforces typed agent I/O at runtime, DSPy optimizes the prompting strategy itself as a compile step. It pairs naturally with RAG pipelines and sits upstream of fine-tuning when labeled data is scarce but evaluation examples exist. This guide covers Signatures, Modules, LM configuration, teleprompters, metrics and assertions, a Harbor Analytics policy Q&A optimizer worked example, a framework decision table, common pitfalls, and a practitioner checklist.
What DSPy is (and is not)
DSPy is a Python framework for building and optimizing LM pipelines. You declare what a step should do (inputs and outputs) without hard-coding how to phrase the prompt. At compile time, a teleprompter runs your program on training examples, proposes candidate prompts and demonstrations, scores them with your metric, and writes the winning configuration back into the module.
It is not a chat UI, a vector database, or a model-hosting platform. It does not replace production serving infrastructure. Complex multi-day agent graphs with human interrupts still belong in LangGraph. Reach for DSPy when you have a repeatable task (classification, extraction, short reasoning, RAG answers) and a dev set with a clear metric — and you are tired of brittle prompt strings that nobody can reproduce.
Core primitives
- Signature — declarative input/output field schema, often as a string like
"question -> answer"or a subclass ofdspy.Signature. - Module — composable building block (
Predict,ChainOfThought,ReAct,Retrieve) that implements a Signature. - Example — labeled
dspy.Examplewith.with_inputs()marking which fields the optimizer may not see at inference. - LM — configured language model via
dspy.LMordspy.configure(lm=...). - Teleprompter — optimizer (
BootstrapFewShot,MIPROv2,COPRO) that compiles prompts from data. - Metric — Python function scoring prediction vs gold label; drives optimization.
Signatures: declare the contract
A Signature is the typed boundary between your application and the model. Inline form is concise for prototypes:
import dspy
classify = dspy.Predict("email -> category, confidence")
result = classify(email="Refund request for invoice 4421")
print(result.category, result.confidence)
For production, subclass dspy.Signature and add field descriptions —
they become part of the compiled prompt and materially affect optimization quality:
class RouteTicket(dspy.Signature):
"""Route Harbor support email to the correct queue."""
email_body: str = dspy.InputField(desc="Full plaintext email")
queue: str = dspy.OutputField(desc="billing, dispatch, or compliance")
reason: str = dspy.OutputField(desc="One sentence justification")
Signatures are composable: a RAG pipeline might chain
"context, question -> search_query" with
"context, question -> answer". Keep output fields minimal; every extra
field is another surface for the optimizer and the metric to disagree on.
Modules: Predict, ChainOfThought, and beyond
Modules wrap Signatures with prompting strategies. The optimizer tunes the strategy, not just the instructions.
Common modules
- Predict — direct input-to-output mapping; fastest and cheapest when the task is shallow.
- ChainOfThought — adds a
reasoningfield before the answer; use when intermediate steps improve accuracy on math, policy, or multi-hop questions. - ReAct — interleaves tool calls and reasoning; pairs with
dspy.Tooldefinitions for search or calculators. - Retrieve — fetches passages from a retriever you attach; standard building block for RAG programs.
- Program / Module subclass — custom
forward()methods composing multiple steps with shared state.
A minimal RAG program might look like:
class PolicyQA(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=5)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
The retriever (ColBERT, BM25, embedding search) stays pluggable; DSPy optimizes how the
generator uses the retrieved context string, including which few-shot
demonstrations teach faithful citation behavior.
Configuring language models
DSPy 2.x uses dspy.LM with provider strings such as
dspy.LM('openai/gpt-4o-mini') or dspy.LM('anthropic/claude-3-5-sonnet-20241022').
Call dspy.configure(lm=lm) once at process start. For local models, point at
an OpenAI-compatible endpoint (Ollama, vLLM) with api_base and
api_key parameters.
Optimization is model-specific: compile against the same model family you deploy, or re-run teleprompters after upgrades. A common pattern is optimize on a capable teacher model, then evaluate distilled prompts on a cheaper student — but always verify the metric on the student before shipping. Log token usage during compilation; BootstrapFewShot can burn thousands of calls if your dev set and candidate pool are large.
Teleprompters: compile prompts from data
Teleprompters are DSPy’s optimizers. You pass a student module, a metric, and training examples; they return a compiled module with tuned instructions and demonstrations.
Optimizer selection
- BootstrapFewShot — generates and filters few-shot examples; fast baseline when you have 20–200 labeled examples.
- BootstrapFewShotWithRandomSearch — explores multiple demonstration sets; better quality, higher compile cost.
- MIPROv2 — joint instruction and demonstration search; strong default for medium dev sets when budget allows.
- COPRO — coordinate ascent over instructions; useful when examples are scarce but you can iterate on phrasing.
- BootstrapFinetune — collects traces and fine-tunes weights; bridges toward full fine-tuning when API-only prompts plateau.
from dspy.teleprompt import BootstrapFewShot
def exact_match(example, pred, trace=None):
return example.queue.lower() == pred.queue.lower()
teleprompter = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
compiled_router = teleprompter.compile(student=RouteModule(), trainset=train_examples)
Hold out a validation set never seen during compile. Overfitting demonstrations to 50 training rows is the most common DSPy failure mode — your metric on train hits 95% while production emails from a new depot cluster fail silently.
Metrics, evaluation, and assertions
Your metric is the objective function. Design it to match business harm, not leaderboard aesthetics:
- Classification — exact match, weighted F1, or cost-sensitive scores (false compliance routing > false billing).
- Generation — token F1, ROUGE, or LLM-as-judge with a rubric; see LLM evaluation for caveats on judge bias.
- RAG — answer correctness conditioned on citation overlap with gold spans.
dspy.Assert and dspy.Suggest add runtime constraints during
optimization (e.g. “answer must contain a policy section ID”). Assertions
backtrack and retry within a trace; use them sparingly in production inference because
retries multiply latency. Prefer baking hard constraints into the metric when possible.
Worked example: Harbor Analytics policy Q&A optimizer
Harbor Analytics maintains 120 internal compliance policies as markdown files. A naive RAG chatbot hallucinated section numbers on 18% of audit-trail questions. The team rebuilt the flow in DSPy:
- Data — 85
dspy.Examplerows from past auditor queries with gold answers and mandatory citation spans; 20 held out for validation. - Program —
PolicyQAmodule withRetrieve(k=8)over a ColBERTv2 index andChainOfThoughtgenerator Signature requiringanswerandcitationsfields. - Metric — 0.5 × token F1 on answer + 0.5 × recall on citation section IDs; zero score if any cited ID is not in retrieved context.
- Compile —
MIPROv2withnum_candidates=10on GPT-4o-mini; 45-minute compile job, ~$12 API spend. - Result — validation citation recall 71% → 89%; hallucinated section IDs 18% → 4%.
- Deploy — serialized compiled program (demos + instruction) loaded in FastAPI; re-compile monthly when policies change, not on every request.
The win was not a cleverer prompt written by hand — it was a searchable demonstration set and a metric that penalized unsupported citations. Manual prompt edits had optimized prose fluency while the metric needed faithfulness.
Framework decision table
| Need | DSPy | Manual prompts | LangChain | Pydantic AI |
|---|---|---|---|---|
| Optimize prompts from labeled dev set | Native | Manual | Limited | No |
| Typed runtime validation of outputs | Via Signatures | Manual | Parsers | Native Pydantic |
| RAG pipeline composition | Retrieve modules | DIY | Rich | Bring your own |
| Multi-agent role crews | Composable modules | DIY | Patterns | Single agent |
| Durable graph checkpoints | No | No | LangGraph | Limited |
| FastAPI microservice I/O | Good | Good | Good | Excellent |
| Fine-tune when prompts plateau | BootstrapFinetune | N/A | Separate | Separate |
Choose DSPy when you have labeled examples and a metric but prompts are still moving targets. Pair with Pydantic AI at the API boundary for strict response validation after DSPy generation, or with LangGraph when the workflow needs branching state machines beyond a compile-once program.
Common pitfalls
- Tiny trainsets — optimizing on 10 rows memorizes noise; aim for dozens minimum and always validate holdout.
- Metric mismatch — optimizing fluency while production needs citation fidelity wastes compile budget.
- Compile-serve model skew — GPT-4o demonstrations often fail on GPT-4o-mini without re-compile.
- Unbounded compile cost — MIPRO with large candidate pools on big dev sets; cap trials and log spend.
- No version control for compiled programs — serialize demos and instructions; treat compiles like model artifacts.
- Assertions in hot paths — runtime backtracking spikes latency; prefer metric-driven compile over retry loops.
- Ignoring retriever quality — DSPy cannot optimize away a bad index; fix retrieval recall first.
- Skipping human review on failure slices — inspect validation misses; often reveals missing demonstration categories.
Practitioner checklist
- Define a Signature with Field descriptions for every input and output.
- Build a Module (or compose Predict / ChainOfThought / Retrieve) matching the task.
- Collect labeled
dspy.Examplerows; mark inputs with.with_inputs(). - Split train and validation sets; never optimize on validation data.
- Implement a metric aligned with production failure cost.
- Start with
BootstrapFewShot; escalate toMIPROv2if needed. - Configure
dspy.LMto the same model family you will deploy. - Serialize the compiled module; pin artifact version in deployment config.
- Re-compile when policies, models, or retrieval corpora change materially.
- Monitor validation metric in production via sampled human audit.
- Document compile cost and duration per release for finance review.
- Compare against a manual-prompt baseline before claiming DSPy wins.
Key takeaways
- DSPy treats LM pipelines as optimizable programs, not static prompt strings.
- Signatures declare typed I/O; Modules implement prompting strategies the teleprompter tunes.
- Teleprompters search instructions and few-shot demos against your metric on a dev set.
- Best fit: classification, extraction, and RAG Q&A with labeled examples and clear evaluation.
- Compile artifacts, validate holdout, and re-run optimization when models or corpora change.
Related reading
- RAG explained — retrieval pipelines DSPy modules wrap
- LLM evaluation and benchmarking explained — designing metrics that match production harm
- LLM GEPA prompt optimization explained — reflective instruction evolution and Pareto compile
- LLM fine-tuning explained — when BootstrapFinetune beats prompt-only optimization
- Pydantic AI fundamentals explained — typed runtime validation at the API boundary