Guide

DSPy fundamentals explained

Most teams ship LLM features by hand-tuning prompts in a notebook, then watching quality collapse when the model vendor ships an update. DSPy (Declarative Self-improving Python) from Stanford’s NLP group reframes the problem: you write programs composed of typed Signatures and reusable Modules, then let teleprompters (optimizers) search for better prompts, few-shot examples, and even weights — against a metric you define on a dev set. Where LangChain generalizes chains and Pydantic AI enforces typed agent I/O at runtime, DSPy optimizes the prompting strategy itself as a compile step. It pairs naturally with RAG pipelines and sits upstream of fine-tuning when labeled data is scarce but evaluation examples exist. This guide covers Signatures, Modules, LM configuration, teleprompters, metrics and assertions, a Harbor Analytics policy Q&A optimizer worked example, a framework decision table, common pitfalls, and a practitioner checklist.

What DSPy is (and is not)

DSPy is a Python framework for building and optimizing LM pipelines. You declare what a step should do (inputs and outputs) without hard-coding how to phrase the prompt. At compile time, a teleprompter runs your program on training examples, proposes candidate prompts and demonstrations, scores them with your metric, and writes the winning configuration back into the module.

It is not a chat UI, a vector database, or a model-hosting platform. It does not replace production serving infrastructure. Complex multi-day agent graphs with human interrupts still belong in LangGraph. Reach for DSPy when you have a repeatable task (classification, extraction, short reasoning, RAG answers) and a dev set with a clear metric — and you are tired of brittle prompt strings that nobody can reproduce.

Core primitives

Signature — declarative input/output field schema, often as a string like "question -> answer" or a subclass of dspy.Signature.
Module — composable building block (Predict, ChainOfThought, ReAct, Retrieve) that implements a Signature.
Example — labeled dspy.Example with .with_inputs() marking which fields the optimizer may not see at inference.
LM — configured language model via dspy.LM or dspy.configure(lm=...).
Teleprompter — optimizer (BootstrapFewShot, MIPROv2, COPRO) that compiles prompts from data.
Metric — Python function scoring prediction vs gold label; drives optimization.

Signatures: declare the contract

A Signature is the typed boundary between your application and the model. Inline form is concise for prototypes:

import dspy

classify = dspy.Predict("email -> category, confidence")

result = classify(email="Refund request for invoice 4421")
print(result.category, result.confidence)

For production, subclass dspy.Signature and add field descriptions — they become part of the compiled prompt and materially affect optimization quality:

class RouteTicket(dspy.Signature):
    """Route Harbor support email to the correct queue."""
    email_body: str = dspy.InputField(desc="Full plaintext email")
    queue: str = dspy.OutputField(desc="billing, dispatch, or compliance")
    reason: str = dspy.OutputField(desc="One sentence justification")

Signatures are composable: a RAG pipeline might chain "context, question -> search_query" with "context, question -> answer". Keep output fields minimal; every extra field is another surface for the optimizer and the metric to disagree on.

Modules: Predict, ChainOfThought, and beyond

Modules wrap Signatures with prompting strategies. The optimizer tunes the strategy, not just the instructions.

Common modules

Predict — direct input-to-output mapping; fastest and cheapest when the task is shallow.
ChainOfThought — adds a reasoning field before the answer; use when intermediate steps improve accuracy on math, policy, or multi-hop questions.
ReAct — interleaves tool calls and reasoning; pairs with dspy.Tool definitions for search or calculators.
Retrieve — fetches passages from a retriever you attach; standard building block for RAG programs.
Program / Module subclass — custom forward() methods composing multiple steps with shared state.

A minimal RAG program might look like:

class PolicyQA(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=5)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

The retriever (ColBERT, BM25, embedding search) stays pluggable; DSPy optimizes how the generator uses the retrieved context string, including which few-shot demonstrations teach faithful citation behavior.

Configuring language models

DSPy 2.x uses dspy.LM with provider strings such as dspy.LM('openai/gpt-4o-mini') or dspy.LM('anthropic/claude-3-5-sonnet-20241022'). Call dspy.configure(lm=lm) once at process start. For local models, point at an OpenAI-compatible endpoint (Ollama, vLLM) with api_base and api_key parameters.

Optimization is model-specific: compile against the same model family you deploy, or re-run teleprompters after upgrades. A common pattern is optimize on a capable teacher model, then evaluate distilled prompts on a cheaper student — but always verify the metric on the student before shipping. Log token usage during compilation; BootstrapFewShot can burn thousands of calls if your dev set and candidate pool are large.

Teleprompters: compile prompts from data

Teleprompters are DSPy’s optimizers. You pass a student module, a metric, and training examples; they return a compiled module with tuned instructions and demonstrations.

Optimizer selection

BootstrapFewShot — generates and filters few-shot examples; fast baseline when you have 20–200 labeled examples.
BootstrapFewShotWithRandomSearch — explores multiple demonstration sets; better quality, higher compile cost.
MIPROv2 — joint instruction and demonstration search; strong default for medium dev sets when budget allows.
COPRO — coordinate ascent over instructions; useful when examples are scarce but you can iterate on phrasing.
BootstrapFinetune — collects traces and fine-tunes weights; bridges toward full fine-tuning when API-only prompts plateau.

from dspy.teleprompt import BootstrapFewShot

def exact_match(example, pred, trace=None):
    return example.queue.lower() == pred.queue.lower()

teleprompter = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
compiled_router = teleprompter.compile(student=RouteModule(), trainset=train_examples)

Hold out a validation set never seen during compile. Overfitting demonstrations to 50 training rows is the most common DSPy failure mode — your metric on train hits 95% while production emails from a new depot cluster fail silently.

Metrics, evaluation, and assertions

Your metric is the objective function. Design it to match business harm, not leaderboard aesthetics:

Classification — exact match, weighted F1, or cost-sensitive scores (false compliance routing > false billing).
Generation — token F1, ROUGE, or LLM-as-judge with a rubric; see LLM evaluation for caveats on judge bias.
RAG — answer correctness conditioned on citation overlap with gold spans.

dspy.Assert and dspy.Suggest add runtime constraints during optimization (e.g. “answer must contain a policy section ID”). Assertions backtrack and retry within a trace; use them sparingly in production inference because retries multiply latency. Prefer baking hard constraints into the metric when possible.

Worked example: Harbor Analytics policy Q&A optimizer

Harbor Analytics maintains 120 internal compliance policies as markdown files. A naive RAG chatbot hallucinated section numbers on 18% of audit-trail questions. The team rebuilt the flow in DSPy:

Data — 85 dspy.Example rows from past auditor queries with gold answers and mandatory citation spans; 20 held out for validation.
Program — PolicyQA module with Retrieve(k=8) over a ColBERTv2 index and ChainOfThought generator Signature requiring answer and citations fields.
Metric — 0.5 × token F1 on answer + 0.5 × recall on citation section IDs; zero score if any cited ID is not in retrieved context.
Compile — MIPROv2 with num_candidates=10 on GPT-4o-mini; 45-minute compile job, ~$12 API spend.
Result — validation citation recall 71% → 89%; hallucinated section IDs 18% → 4%.
Deploy — serialized compiled program (demos + instruction) loaded in FastAPI; re-compile monthly when policies change, not on every request.

The win was not a cleverer prompt written by hand — it was a searchable demonstration set and a metric that penalized unsupported citations. Manual prompt edits had optimized prose fluency while the metric needed faithfulness.

Framework decision table

Need	DSPy	Manual prompts	LangChain	Pydantic AI
Optimize prompts from labeled dev set	Native	Manual	Limited	No
Typed runtime validation of outputs	Via Signatures	Manual	Parsers	Native Pydantic
RAG pipeline composition	Retrieve modules	DIY	Rich	Bring your own
Multi-agent role crews	Composable modules	DIY	Patterns	Single agent
Durable graph checkpoints	No	No	LangGraph	Limited
FastAPI microservice I/O	Good	Good	Good	Excellent
Fine-tune when prompts plateau	BootstrapFinetune	N/A	Separate	Separate

Choose DSPy when you have labeled examples and a metric but prompts are still moving targets. Pair with Pydantic AI at the API boundary for strict response validation after DSPy generation, or with LangGraph when the workflow needs branching state machines beyond a compile-once program.

Common pitfalls

Tiny trainsets — optimizing on 10 rows memorizes noise; aim for dozens minimum and always validate holdout.
Metric mismatch — optimizing fluency while production needs citation fidelity wastes compile budget.
Compile-serve model skew — GPT-4o demonstrations often fail on GPT-4o-mini without re-compile.
Unbounded compile cost — MIPRO with large candidate pools on big dev sets; cap trials and log spend.
No version control for compiled programs — serialize demos and instructions; treat compiles like model artifacts.
Assertions in hot paths — runtime backtracking spikes latency; prefer metric-driven compile over retry loops.
Ignoring retriever quality — DSPy cannot optimize away a bad index; fix retrieval recall first.
Skipping human review on failure slices — inspect validation misses; often reveals missing demonstration categories.

Practitioner checklist

Define a Signature with Field descriptions for every input and output.
Build a Module (or compose Predict / ChainOfThought / Retrieve) matching the task.
Collect labeled dspy.Example rows; mark inputs with .with_inputs().
Split train and validation sets; never optimize on validation data.
Implement a metric aligned with production failure cost.
Start with BootstrapFewShot; escalate to MIPROv2 if needed.
Configure dspy.LM to the same model family you will deploy.
Serialize the compiled module; pin artifact version in deployment config.
Re-compile when policies, models, or retrieval corpora change materially.
Monitor validation metric in production via sampled human audit.
Document compile cost and duration per release for finance review.
Compare against a manual-prompt baseline before claiming DSPy wins.

Key takeaways

DSPy treats LM pipelines as optimizable programs, not static prompt strings.
Signatures declare typed I/O; Modules implement prompting strategies the teleprompter tunes.
Teleprompters search instructions and few-shot demos against your metric on a dev set.
Best fit: classification, extraction, and RAG Q&A with labeled examples and clear evaluation.
Compile artifacts, validate holdout, and re-run optimization when models or corpora change.