Guide
LLM few-shot learning explained
Harbor Legal's contract clause extractor asked GPT-4o to return JSON with
indemnity caps, liability floors, and governing-law fields. Engineers pasted six
“examples” pulled at random from past tickets. Three exemplars used
cap_amount as a string; two used integers; one nested caps under a
liability object the schema no longer allowed. Parse failures hit
19% on production PDFs, and indemnity-cap F1 stalled at 0.71. Replacing the random
set with six curated, schema-identical exemplars — one per edge case
(uncapped, tiered cap, mutual indemnity, carve-out list, missing clause, OCR
garble) — lifted F1 to 0.89 and cut repair-loop retries by 74%. No weight
update, no new training run.
Few-shot learning in LLM applications means teaching a task by showing labeled input-output pairs inside the prompt instead of updating model weights. The model infers the pattern from exemplars at inference time — researchers call this in-context learning. It is the fastest way to prototype classifiers, extractors, and format converters, but it is also easy to waste tokens on noisy examples that teach the wrong habit. This guide covers zero/one/few-shot taxonomy, exemplar selection and ordering, system prompt integration and delimiter patterns, token budget tradeoffs, when few-shot beats fine-tuning or RAG, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.
Shot-count taxonomy
“Shots” are completed demonstrations: input text plus the exact output you want on similar inputs.
| Mode | What you show | Typical use |
|---|---|---|
| Zero-shot | Task description only; no labeled pairs | General tasks the base model already knows (summarize, translate common languages) |
| One-shot | Single input-output pair | Output format anchoring (JSON shape, tone, citation style) |
| Few-shot | Two to ~12 curated pairs | Domain classifiers, extractors, style matchers, brittle format rules |
| Many-shot | Dozens to hundreds of pairs in context | Emerging with long-context models; overlaps with retrieval-augmented exemplar banks |
More shots are not monotonically better. Duplicate or contradictory exemplars dilute the signal; each extra pair consumes context window that could hold user documents. Start with three to six diverse shots; add only when eval shows a specific failure mode unaddressed.
What makes a good exemplar
Exemplars are training data you ship on every request. Treat curation like a labeled dataset, not copy-paste from logs.
Coverage over volume
Each exemplar should teach one distinct decision boundary: negation, empty input, multi-label output, ambiguous phrasing, OCR noise, edge-case units. Two examples of “happy path” invoices teach less than one happy path plus one credit memo with negative amounts.
Output fidelity
Every shot must match the production schema byte-for-byte: same field names,
nesting, null conventions, and enum values. Models mimic surface form before
semantics. Mixed true/"true" booleans in
shots guarantee mixed booleans in production.
Input realism
Use text that resembles live traffic length, vocabulary, and noise. Synthetic one-liners teach oversimplified patterns that break on real paragraphs. Redact PII but preserve structure.
Dynamic vs static banks
Static sets are versioned, reviewed, and cached — ideal for stable extractors. Dynamic retrieval embeds the user input, pulls the k nearest labeled rows from a store, and injects them as shots. That scales coverage but needs freshness controls so retrieved shots do not contradict the current schema. Hybrid patterns (four static anchor shots plus two retrieved neighbors) often outperform either alone for intent routing.
Prompt structure and ordering
Few-shot blocks live inside the broader context engineering stack: system rules, shots, optional retrieved docs, then the live user turn.
Delimiter patterns
Wrap each shot in explicit markers so the model does not merge example output with instructions:
<example>
<input>…user text…</input>
<output>…canonical answer…</output>
</example>
XML-style tags, Markdown headings (### Example 1), or JSON arrays
of {"input","output"} objects all work if used
consistently. Never rely on bare Q/A lines without separators — models
confuse the last shot's answer with the task preamble.
Ordering effects
Models exhibit primacy and recency bias (see lost in the middle). Place the hardest or most format-critical exemplar first and repeat the output schema reminder after the shot block. For classification, shuffle shot order across requests in eval to detect ordering luck; fix prompts that only work with one permutation.
Separate instructions from demonstrations
System prompt: role, safety, global rules, output schema. Few-shot block: only
pairs. User message: only the live input. Mixing rules inside
<example> tags teaches the model to ignore the system layer.
Token budget and latency
Few-shot cost is linear in shot tokens on every request unless you use prompt caching for static prefixes. Estimate:
- Shot block size — sum tokens of all exemplars plus delimiters; target under 15% of window for RAG-heavy apps.
- Cache hit rate — identical shot prefixes should hit provider cache; version shots in the cache key.
- Truncation policy — if user docs compete with shots, drop retrieved neighbors before dropping anchor exemplars.
Log (shot_version, shot_token_count, parse_success, task_metric)
per request. When shot tokens exceed 2,000 and accuracy plateaus, you are in
fine-tune territory.
Harbor Legal clause extractor refactor
Harbor's v3 prompt ships six static shots in a cached prefix:
- Uncapped mutual indemnity — returns
nullcap with explicituncapped: trueflag. - Tiered cap — general aggregate plus per-claim sub-cap.
- Carve-out list — cap applies except fraud, willful misconduct, IP.
- Missing clause — all fields
null, confidence low. - OCR garble — partial extraction with
needs_review. - Governing law only — single-field extraction without hallucinated caps.
A shot_version: 2026-06-03 tag sits in logs. Outputs pass through
schema
validation; failures trigger one repair turn with the validator error, not a
full re-prompt. Dynamic retrieval adds two neighbor clauses from an approved
exemplar index when confidence from the static set is borderline. Weekly legal
review approves shot changes; anything else is blocked at CI.
Technique decision table
| Scenario | Preferred approach | Avoid |
|---|---|---|
| Prototype new extractor or classifier | 4–8 curated few-shot shots + structured output | Immediate LoRA run on 40 examples |
| High-volume stable task, tight latency | Fine-tuned small model or rules; drop shots from hot path | 12-shot GPT-4 call per row |
| Rapidly shifting label schema | Versioned few-shot set + prompt registry | Stale fine-tune weights |
| Long-tail entity variants | Static anchors + retrieved dynamic shots | 100-shot static prompt |
| Knowledge-heavy Q&A with citations | RAG corpus; zero-shot or one-shot format only | Shots containing factual paragraphs that go stale |
| Safety-critical classification | Supervised classifier + few-shot only in abstain band | Prompt-only policy with random log shots |
Common pitfalls
- Random log sampling — production errors become tomorrow's exemplars; curate intentionally.
- Schema drift in shots — one old field name teaches persistent JSON bugs.
- Label leakage — shots that include the answer to the live input (duplicate ticket IDs).
- Contradictory demonstrations — two shots map the same input pattern to different labels.
- Overfitting to shots — model quotes exemplar text instead of analyzing user input; diversify phrasing.
- Ignoring eval splits — tuning shots on the test set you report; hold out fresh rows monthly.
- No versioning — prompt changes untracked; impossible to bisect regressions.
Production checklist
- Define shot-count budget and max tokens for the exemplar block.
- Curate 4–8 shots covering distinct edge cases; no duplicates.
- Every shot output validates against the production JSON schema in CI.
- Use explicit delimiters; separate system rules from demonstrations.
- Version the shot set (
shot_version) in logs and prompt registry. - Enable prompt caching for static shot prefixes where the provider supports it.
- Eval with shuffled shot order; accuracy variance < 2% across permutations.
- Hold out a labeled set never used during shot selection.
- Revisit shots when task metric drops or schema changes — not on every model upgrade.
- Document when to graduate to fine-tuning (volume, latency, or plateaued few-shot F1).
Key takeaways
- Few-shot learning teaches tasks via in-prompt exemplars — no weight update required.
- Exemplar quality and schema consistency matter more than shot count.
- Delimiter structure and ordering reduce format drift and lost-in-the-middle failures.
- Harbor Legal lifted indemnity-cap F1 from 0.71 to 0.89 by replacing random shots with six curated edge cases.
- Graduate to fine-tuning or classifiers when shot tokens, volume, or latency outgrow prompt-only economics.
Related reading
- LLM system prompt design explained — modular instructions separate from shots
- LLM fine-tuning vs RAG explained — when to move beyond in-context learning
- Context engineering explained — assembling the full prompt stack
- LLM output parsing and validation explained — enforcing schemas after generation