Guide
Semi-supervised learning explained
Most real-world datasets are mostly unlabeled. A support inbox might hold two million tickets but only eight hundred that humans have tagged. A product catalog might have fifty thousand SKUs with category labels on five hundred. Semi-supervised learning (SSL) trains on both pools at once: a small supervised signal from labeled examples plus structure extracted from the unlabeled majority. Unlike pure self-supervised learning, which invents pretext tasks without any labels, SSL assumes you have some ground truth and asks how to propagate it. Unlike active learning, which chooses what to label next, SSL uses the unlabeled mass you already have. Methods range from simple pseudo-labeling to consistency regularization and graph propagation. This guide covers the assumptions SSL relies on, major techniques, a Harbor document-classification worked example, a technique decision table, common pitfalls, and a production checklist.
Where semi-supervised learning sits in the ML landscape
Machine learning paradigms differ by how much supervision you can afford:
- Supervised learning — every training example has a label. Accurate but expensive at scale.
- Unsupervised learning — no labels; discover clusters, anomalies, or latent structure.
- Semi-supervised learning — a labeled subset plus a large unlabeled set drawn from the same distribution.
- Self-supervised learning — no manual labels, but automatic targets from the data itself (predict masked tokens, match augmented views).
- Active learning — iteratively choose which unlabeled points to send for human annotation.
In practice these blend. A team might pre-train with self-supervision, fine-tune on 1,000 labels with FixMatch-style SSL, then run active learning to label the next thousand highest-value examples. The unifying goal is the same: reach supervised accuracy with less human effort.
Assumptions: when unlabeled data actually helps
SSL is not magic. Gains appear only when unlabeled examples share structure with labeled ones. Classical theory names three assumptions:
Smoothness assumption
Nearby points in input space should share labels. Small perturbations (crop, paraphrase, noise) should not flip the prediction. Consistency regularization methods encode this directly.
Cluster assumption
Data form clusters and points in the same cluster likely share a class. If unlabeled mass sits between labeled clusters of different classes, pseudo-labels will be wrong and self-reinforcing.
Manifold assumption
High-dimensional data lie on a lower-dimensional manifold. Decision boundaries should cut across low-density regions, not through dense clouds of unlabeled points. Graph-based SSL exploits this by propagating labels along edges connecting similar examples.
When these fail — heavy covariate shift between labeled and unlabeled pools, adversarial unlabeled data, or labels that do not correlate with geometry — SSL can hurt accuracy versus supervised-only baselines. Always benchmark against a strong supervised model trained on labels alone before trusting SSL gains.
Pseudo-labeling and self-training
The oldest SSL recipe is pseudo-labeling (also called self-training):
- Train a model on labeled data only.
- Run inference on unlabeled examples.
- Assign the argmax class as a pseudo-label for confident predictions (e.g. probability above 0.9).
- Retrain on labeled + pseudo-labeled data combined.
- Repeat until convergence or a fixed number of rounds.
Simple and surprisingly strong when classes are separable and the initial model is decent — often after transfer learning from a pre-trained backbone. Failure modes are equally simple: one wrong high-confidence prediction poisons thousands of similar unlabeled points in the next round. Mitigations include confidence thresholds, class-balanced sampling of pseudo-labels, and stopping after one or two iterations instead of looping until collapse.
Co-training and multi-view learning
When each example has two independent views (web page text + HTML metadata, image + caption), train separate classifiers per view. Each view labels the other’s unlabeled data where it is confident. Co-training works when views are conditionally independent given the class — a strong assumption but effective on classic text+hyperlink tasks.
Consistency regularization
Modern deep SSL rarely trusts a single forward pass. Instead, the model must output consistent predictions under stochastic perturbations of the same input:
- Π-model / Mean Teacher — student network sees augmented input; teacher network (EMA weight average of student) sees a different augmentation. Loss = supervised loss on labeled batch + MSE or KL between student and teacher predictions on unlabeled batch.
- FixMatch — weak augmentation (flip, crop) produces pseudo-label if confidence exceeds threshold; strong augmentation (RandAugment, Cutout) must match that pseudo-label. Became a standard vision SSL baseline because it is simple and reproducible.
- UDA, MixMatch, ReMixMatch — combine consistency with sharpening, mixing unlabeled pairs, and entropy minimization in various combinations.
The supervised term anchors the model to ground truth; the consistency term spreads label signal across the unlabeled manifold. Hyperparameters that matter: confidence threshold (too low → noise; too high → unused data), loss weight between supervised and unsupervised terms, and augmentation strength.
Entropy minimization and graph-based methods
Entropy minimization adds a loss encouraging sharp (low-entropy) predictions on unlabeled data — the model should commit, not hedge at 0.51/0.49 on every point. Combined with consistency, this pushes decision boundaries away from dense regions. Alone, it can collapse to predicting one class everywhere; never use it without a supervised anchor.
Graph-based SSL builds a k-nearest-neighbor graph over embeddings, then propagates labels via label spreading or harmonic functions. Strong when data are moderately sized and similarity is meaningful (document TF-IDF, shallow embeddings). Weak at web scale without approximate nearest neighbor indexes; deep consistency methods usually scale better for images and text with transformers.
Worked example: Harbor support ticket routing
Harbor Ops runs an internal help desk. Annotators have labeled 600 tickets into five queues (billing, technical, account, abuse, other). The warehouse holds 180,000 historical tickets with no labels. Goal: route new tickets automatically with at least 88% macro-F1.
Baseline
Fine-tune a small text encoder on labeled tickets only: 79% macro-F1 on a held-out 120-ticket test set. Classes like abuse (40 training examples) drag the average.
SSL pipeline
- Embed all tickets with the same encoder (frozen first epoch, then joint).
- Train with FixMatch-style loss: labeled batch uses true labels; unlabeled batch uses paraphrase + dropout as weak/strong views.
- Confidence threshold 0.85 for pseudo-labels; cap pseudo-labels per class per epoch to prevent majority-class flooding.
- Mix 10% random unlabeled examples each epoch (no pseudo-label) for exploration.
Result
After three epochs, macro-F1 rises to 86% — still short of target. The team layers active learning: 200 additional human labels on high-entropy abuse and billing tickets, then one more SSL training run → 89% macro-F1. Total human labels: 800 instead of the ~2,400 pure supervised labeling would have needed for the same score (estimated from learning curves).
Lesson: SSL is a multiplier on scarce labels, not a replacement for fixing class imbalance or ambiguous annotation guidelines.
Choosing a semi-supervised technique
| Scenario | Start here | Why |
|---|---|---|
| Vision, >10k unlabeled, pre-trained CNN/ViT | FixMatch or Mean Teacher | Consistency + augmentations are mature and well-benchmarked |
| Text classification, transformer backbone | Pseudo-label + consistency (UDA-style) | Dropout and paraphrase views are cheap |
| Two natural views per example | Co-training | Independent views reduce confirmation bias |
| <5k total points, interpretable similarity | Graph label spreading | Simple, no deep training loop required |
| Zero labels initially | Self-supervised pre-training first | SSL needs some anchor labels — use SSL after seed or pre-train |
| Labels expensive, unlabeled pool still growing | SSL pre-train + active learning | SSL squeezes existing mass; AL picks next labels |
Semi-supervised learning vs related paradigms
| Paradigm | Labeled data needed | Unlabeled data role | Best when |
|---|---|---|---|
| Supervised only | Large | Ignored | Labels are cheap or dataset is small |
| Semi-supervised | Small seed | Direct training signal via pseudo-labels / consistency | Same distribution, abundant unlabeled data |
| Self-supervised | None for pretext | Creates pretext targets | Pre-train backbone before any downstream labels |
| Active learning | Grows iteratively | Queried for labeling | Label budget is the bottleneck, not compute |
Anti-patterns
- Confidence threshold too low — pseudo-label noise compounds each self-training round until the model drifts.
- Ignoring class imbalance — majority-class pseudo-labels swamp rare classes; use per-class caps and weighted supervised loss.
- Train/test leakage — unlabeled pool must not include test examples; SSL amplifies leakage into inflated metrics.
- Distribution mismatch — unlabeled data from 2024 product lines with labels only from 2022 SKUs; SSL propagates stale decisions.
- Skipping supervised baseline — report gain over labeled-only training, not only over random guessing.
- Infinite self-training loops — two rounds with monitoring often beat ten rounds that amplify errors.
- No calibration check — high softmax confidence does not mean correct; track calibration on a labeled validation slice.
Production checklist
- Verify labeled and unlabeled pools share the same distribution (or document shift).
- Establish a supervised-only baseline on the current label set.
- Start from a transfer-learned or self-supervised backbone when possible.
- Use confidence thresholds and per-class pseudo-label quotas.
- Combine consistency loss with a strong supervised term — never unlabeled-only.
- Hold out a frozen test set; monitor macro-F1 per class, not accuracy alone.
- Limit self-training iterations; watch for pseudo-label error rate drift.
- Plan hybrid with active learning if SSL plateaus below target.
- Log pseudo-label acceptance rates and version models for reproducibility.
Key takeaways
- Semi-supervised learning trains on a small labeled set plus a large unlabeled pool from the same distribution.
- Gains depend on smoothness, cluster, and manifold assumptions — violated assumptions make SSL harmful.
- Pseudo-labeling is the simple baseline; consistency regularization (FixMatch, Mean Teacher) is the modern deep-learning default.
- SSL pairs naturally with transfer learning, self-supervision, and active learning in production pipelines.
- Always benchmark against supervised-only training and guard against confirmation bias in self-training loops.
Related reading
- Machine learning fundamentals explained — supervised, unsupervised, and reinforcement learning basics
- Self-supervised learning explained — pretext tasks and foundation-model pre-training without labels
- Active learning explained — choosing which examples to label when SSL plateaus
- Transfer learning explained — pre-trained backbones that make pseudo-labels meaningful from round one