Guide

LLM few-shot learning explained

Harbor Legal's contract clause extractor asked GPT-4o to return JSON with indemnity caps, liability floors, and governing-law fields. Engineers pasted six “examples” pulled at random from past tickets. Three exemplars used cap_amount as a string; two used integers; one nested caps under a liability object the schema no longer allowed. Parse failures hit 19% on production PDFs, and indemnity-cap F1 stalled at 0.71. Replacing the random set with six curated, schema-identical exemplars — one per edge case (uncapped, tiered cap, mutual indemnity, carve-out list, missing clause, OCR garble) — lifted F1 to 0.89 and cut repair-loop retries by 74%. No weight update, no new training run.

Few-shot learning in LLM applications means teaching a task by showing labeled input-output pairs inside the prompt instead of updating model weights. The model infers the pattern from exemplars at inference time — researchers call this in-context learning. It is the fastest way to prototype classifiers, extractors, and format converters, but it is also easy to waste tokens on noisy examples that teach the wrong habit. This guide covers zero/one/few-shot taxonomy, exemplar selection and ordering, system prompt integration and delimiter patterns, token budget tradeoffs, when few-shot beats fine-tuning or RAG, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.

Shot-count taxonomy

“Shots” are completed demonstrations: input text plus the exact output you want on similar inputs.

Mode	What you show	Typical use
Zero-shot	Task description only; no labeled pairs	General tasks the base model already knows (summarize, translate common languages)
One-shot	Single input-output pair	Output format anchoring (JSON shape, tone, citation style)
Few-shot	Two to ~12 curated pairs	Domain classifiers, extractors, style matchers, brittle format rules
Many-shot	Dozens to hundreds of pairs in context	Emerging with long-context models; overlaps with retrieval-augmented exemplar banks

More shots are not monotonically better. Duplicate or contradictory exemplars dilute the signal; each extra pair consumes context window that could hold user documents. Start with three to six diverse shots; add only when eval shows a specific failure mode unaddressed.

What makes a good exemplar

Exemplars are training data you ship on every request. Treat curation like a labeled dataset, not copy-paste from logs.

Coverage over volume

Each exemplar should teach one distinct decision boundary: negation, empty input, multi-label output, ambiguous phrasing, OCR noise, edge-case units. Two examples of “happy path” invoices teach less than one happy path plus one credit memo with negative amounts.

Output fidelity

Every shot must match the production schema byte-for-byte: same field names, nesting, null conventions, and enum values. Models mimic surface form before semantics. Mixed true/"true" booleans in shots guarantee mixed booleans in production.

Input realism

Use text that resembles live traffic length, vocabulary, and noise. Synthetic one-liners teach oversimplified patterns that break on real paragraphs. Redact PII but preserve structure.

Dynamic vs static banks

Static sets are versioned, reviewed, and cached — ideal for stable extractors. Dynamic retrieval embeds the user input, pulls the k nearest labeled rows from a store, and injects them as shots. That scales coverage but needs freshness controls so retrieved shots do not contradict the current schema. Hybrid patterns (four static anchor shots plus two retrieved neighbors) often outperform either alone for intent routing.

Prompt structure and ordering

Few-shot blocks live inside the broader context engineering stack: system rules, shots, optional retrieved docs, then the live user turn.

Delimiter patterns

Wrap each shot in explicit markers so the model does not merge example output with instructions:

<example>
<input>…user text…</input>
<output>…canonical answer…</output>
</example>

XML-style tags, Markdown headings (### Example 1), or JSON arrays of {"input","output"} objects all work if used consistently. Never rely on bare Q/A lines without separators — models confuse the last shot's answer with the task preamble.

Ordering effects

Models exhibit primacy and recency bias (see lost in the middle). Place the hardest or most format-critical exemplar first and repeat the output schema reminder after the shot block. For classification, shuffle shot order across requests in eval to detect ordering luck; fix prompts that only work with one permutation.

Separate instructions from demonstrations

System prompt: role, safety, global rules, output schema. Few-shot block: only pairs. User message: only the live input. Mixing rules inside <example> tags teaches the model to ignore the system layer.

Token budget and latency

Few-shot cost is linear in shot tokens on every request unless you use prompt caching for static prefixes. Estimate:

Shot block size — sum tokens of all exemplars plus delimiters; target under 15% of window for RAG-heavy apps.
Cache hit rate — identical shot prefixes should hit provider cache; version shots in the cache key.
Truncation policy — if user docs compete with shots, drop retrieved neighbors before dropping anchor exemplars.

Log (shot_version, shot_token_count, parse_success, task_metric) per request. When shot tokens exceed 2,000 and accuracy plateaus, you are in fine-tune territory.

Harbor Legal clause extractor refactor

Harbor's v3 prompt ships six static shots in a cached prefix:

Uncapped mutual indemnity — returns null cap with explicit uncapped: true flag.
Tiered cap — general aggregate plus per-claim sub-cap.
Carve-out list — cap applies except fraud, willful misconduct, IP.
Missing clause — all fields null, confidence low.
OCR garble — partial extraction with needs_review.
Governing law only — single-field extraction without hallucinated caps.

A shot_version: 2026-06-03 tag sits in logs. Outputs pass through schema validation; failures trigger one repair turn with the validator error, not a full re-prompt. Dynamic retrieval adds two neighbor clauses from an approved exemplar index when confidence from the static set is borderline. Weekly legal review approves shot changes; anything else is blocked at CI.

Technique decision table

Scenario	Preferred approach	Avoid
Prototype new extractor or classifier	4–8 curated few-shot shots + structured output	Immediate LoRA run on 40 examples
High-volume stable task, tight latency	Fine-tuned small model or rules; drop shots from hot path	12-shot GPT-4 call per row
Rapidly shifting label schema	Versioned few-shot set + prompt registry	Stale fine-tune weights
Long-tail entity variants	Static anchors + retrieved dynamic shots	100-shot static prompt
Knowledge-heavy Q&A with citations	RAG corpus; zero-shot or one-shot format only	Shots containing factual paragraphs that go stale
Safety-critical classification	Supervised classifier + few-shot only in abstain band	Prompt-only policy with random log shots

Common pitfalls

Random log sampling — production errors become tomorrow's exemplars; curate intentionally.
Schema drift in shots — one old field name teaches persistent JSON bugs.
Label leakage — shots that include the answer to the live input (duplicate ticket IDs).
Contradictory demonstrations — two shots map the same input pattern to different labels.
Overfitting to shots — model quotes exemplar text instead of analyzing user input; diversify phrasing.
Ignoring eval splits — tuning shots on the test set you report; hold out fresh rows monthly.
No versioning — prompt changes untracked; impossible to bisect regressions.

Production checklist

Define shot-count budget and max tokens for the exemplar block.
Curate 4–8 shots covering distinct edge cases; no duplicates.
Every shot output validates against the production JSON schema in CI.
Use explicit delimiters; separate system rules from demonstrations.
Version the shot set (shot_version) in logs and prompt registry.
Enable prompt caching for static shot prefixes where the provider supports it.
Eval with shuffled shot order; accuracy variance < 2% across permutations.
Hold out a labeled set never used during shot selection.
Revisit shots when task metric drops or schema changes — not on every model upgrade.
Document when to graduate to fine-tuning (volume, latency, or plateaued few-shot F1).

Key takeaways

Few-shot learning teaches tasks via in-prompt exemplars — no weight update required.
Exemplar quality and schema consistency matter more than shot count.
Delimiter structure and ordering reduce format drift and lost-in-the-middle failures.
Harbor Legal lifted indemnity-cap F1 from 0.71 to 0.89 by replacing random shots with six curated edge cases.
Graduate to fine-tuning or classifiers when shot tokens, volume, or latency outgrow prompt-only economics.