Guide

Few-shot learning explained

Few-shot learning is the discipline of building models that generalize from very small labeled datasets — often just one to ten examples per class. In a 5-way 3-shot task, the model sees three labeled examples from each of five categories and must classify new items correctly. This matters because real products rarely start with millions of labels: a new fraud pattern, a rare defect on a factory line, or a support queue that splits into fresh ticket types all arrive with thin evidence. Classical machine learning overfits badly at that scale; few-shot methods borrow structure from related tasks, pretrained representations, or carefully designed prompts. Modern large language models extend the idea through in-context learning — stuffing exemplars into a prompt without gradient updates. This guide covers notation, metric and meta-learning algorithms, the LLM few-shot workflow, a Harbor support-ticket classifier worked example, a technique decision table, pitfalls, and a practitioner checklist.

What few-shot learning is — and how it differs from related ideas

Standard supervised training assumes abundant labels: thousands of cat and dog photos, millions of click logs. Few-shot learning explicitly studies the low-label regime where collecting more annotations is slow, expensive, or impossible.

Key terminology:

N-way — the number of classes in the episode (e.g., five ticket categories).
K-shot — K labeled examples per class in the support set.
Support set — the labeled examples shown at inference or inner-loop training time.
Query set — unlabeled items the model must classify.
Episode — one sampled N-way K-shot task; meta-learners train on thousands of episodes.

Few-shot learning sits between full supervision and zero labels:

Zero-shot — no labeled examples; rely on class names, descriptions, or pretrained knowledge (common with LLMs).
One-shot — exactly one example per class (a special case of few-shot).
Few-shot — a small handful per class (typically 1–10).
Many-shot / full supervision — enough data for conventional training.

It overlaps but is not identical to transfer learning (reuse a pretrained backbone) or semi-supervised learning (many unlabeled points plus some labels). Few-shot often combines transfer with episode-based training or prompting.

Classical approaches: metric learning and meta-learning

Metric-learning methods

Instead of learning a classifier head with thousands of weights per class, metric methods learn an embedding space where same-class points cluster and different-class points separate. At inference, compare a query embedding to support prototypes and pick the nearest neighbor.

Siamese networks — twin encoders trained with contrastive loss so same-class pairs are close and different-class pairs are far.
Prototypical networks — compute each class centroid (prototype) as the mean embedding of its K support examples; classify queries by nearest prototype. Simple, strong baseline for vision and tabular embeddings.
Matching networks — attention-weighted sum over support embeddings; useful when prototypes are noisy.

Meta-learning (learning to learn)

Meta-learners optimize for fast adaptation on new tasks:

MAML (Model-Agnostic Meta-Learning) — find initial weights such that a few gradient steps on a new support set yield good query accuracy. Expensive inner loops but generalizes across task families.
Reptile — lighter first-order approximation of MAML; easier to implement at scale.
Meta-learning for hyperparameters — learn learning rates, data augmentation policies, or class weighting rules that transfer to new N-way tasks.

In practice, start with a strong pretrained encoder (vision transformer, sentence transformer) plus prototypical classification before reaching for full MAML.

Few-shot with large language models: in-context learning

Frontier LLMs exhibit in-context learning (ICL): performance improves when you prepend labeled examples in the prompt, even though no weight updates occur. This is few-shot learning at inference time.

A typical pattern:

Define the task and output format (JSON, single token, yes/no).
Select K diverse, representative examples per class — quality beats quantity.
Order examples carefully; models can be sensitive to recency and label imbalance.
Append the query and parse the completion.

ICL competes with parameter-efficient fine-tuning (LoRA, adapters) when you have hundreds to thousands of labels and need stable production behavior. ICL wins on speed and zero retraining; fine-tuning wins on cost per token, latency, and consistency at scale. Hybrid pipelines few-shot bootstrap labels, then fine-tune once volume grows — a pattern aligned with active learning loops.

Harbor worked example: routing support tickets with five examples per category

Harbor Software runs a B2B help desk. A new product line introduces four ticket types — Billing, Integration API, Data export, and Account access — but only five resolved tickets exist per type (20 labels total). Full supervised training on a bag-of-words model scores 41% on a held-out set; the team needs 85%+ before auto-routing goes live.

Step 1 — Embed with a pretrained model. Each ticket body is encoded with a sentence transformer into a 384-dimensional vector. No task-specific training yet; transfer learning supplies semantic structure.

Step 2 — Prototypical classifier. For each type, average the five support embeddings into a prototype. A new ticket is classified by cosine similarity to the nearest prototype. On a 20% holdout, accuracy reaches 78% — a large jump from 41%.

Step 3 — LLM few-shot fallback for ambiguous cases. Tickets whose top-two prototype scores differ by less than 0.08 are escalated to a GPT-class model with three curated examples per type in the prompt. Combined pipeline: 89% accuracy, routing 62% of volume automatically without human review.

Step 4 — Close the loop. Misrouted tickets reviewed by agents become new support examples; every 50 new labels triggers a LoRA fine-tune on the encoder. Within six weeks Harbor has 200 labels per class and retires the LLM fallback for cost reasons — the few-shot phase bridged the cold-start gap.

Technique decision table

Situation	Recommended approach	Why
1–10 labels per class, structured text or images	Pretrained encoder + prototypical networks	Fast, cheap, no LLM token cost; strong when embeddings align with task
Complex language task, <20 total labels, exploratory	LLM in-context few-shot prompting	No training pipeline; iterate examples in minutes
Many related tasks, need fast on-device adaptation	MAML / Reptile meta-training	Weights primed for few-step fine-tune on new classes
100+ labels per class appearing steadily	LoRA fine-tune or full supervised head	ICL plateaus; dedicated weights beat prompt stuffing
Abundant unlabeled data, scarce labels	Semi-supervised pretrain, then few-shot head	Unlabeled corpus sharpens embeddings before the low-label stage
Classes defined only by names (no examples)	Zero-shot LLM with class descriptions	Few-shot needs at least one exemplar per class

Designing good support examples

Whether you store embeddings or paste text into a prompt, example curation dominates results:

Cover intra-class variance — for “Billing,” include a refund, an invoice mismatch, and a plan upgrade; one narrow example teaches the wrong boundary.
Balance classes — equal K per class in the support set avoids majority-class bias.
Avoid near-duplicates — five paraphrases of the same ticket waste shots; diversity matters more than count.
Match production distribution — training on polished internal samples fails on messy user typos.
Keep prompts within context limits — long examples crowd out room for instructions; summarize when needed.

Common pitfalls

Evaluating on support-set leakage — testing on the same K examples you trained on inflates metrics; hold out separate queries per episode.
Ignoring base-rate priors — if 90% of tickets are Billing, nearest- prototype classifiers need calibrated thresholds or cost-sensitive rules.
Assuming ICL is deterministic — temperature, example order, and phrasing shift LLM few-shot outputs; log and version prompts.
Skipping confidence estimation — low margin between top classes should trigger human review, not auto-action.
Never graduating past few-shot — continuing to pay per-token ICL after thousands of labels accumulate is a budget leak; plan the fine-tune milestone.
Weak embeddings for the domain — general sentence models miss jargon; domain-adaptive pretraining or glossary injection helps.

Practitioner checklist

Define N classes and minimum acceptable precision/recall per class before choosing a method.
Split data into support and query sets with no overlap; use k-fold episode evaluation for small pools.
Baseline with zero-shot (class names only) to quantify the value of each added shot.
Try prototypical networks on pretrained embeddings before complex meta-learning.
For LLM few-shot, version prompts in git and A/B example sets on a fixed eval harness.
Monitor margin scores and route low-confidence predictions to humans or a larger model.
Track label accumulation; schedule fine-tuning when per-class count crosses your threshold.
Document failure modes per class — they guide the next examples you add.

Key takeaways

Few-shot learning targets regimes with only a handful of labels per class, formalized as N-way K-shot episodes.
Metric methods like prototypical networks embed support items and classify by nearest prototype — strong with good encoders.
Meta-learning (MAML, Reptile) optimizes for fast adaptation when many related tasks exist.
LLM in-context few-shot is inference-time learning via exemplars in the prompt — fast to ship, costly at scale.
Example quality and diversity matter more than raw shot count; plan a path to fine-tuning as labels grow.