Guide

Semi-supervised learning explained

Most real-world datasets are mostly unlabeled. A support inbox might hold two million tickets but only eight hundred that humans have tagged. A product catalog might have fifty thousand SKUs with category labels on five hundred. Semi-supervised learning (SSL) trains on both pools at once: a small supervised signal from labeled examples plus structure extracted from the unlabeled majority. Unlike pure self-supervised learning, which invents pretext tasks without any labels, SSL assumes you have some ground truth and asks how to propagate it. Unlike active learning, which chooses what to label next, SSL uses the unlabeled mass you already have. Methods range from simple pseudo-labeling to consistency regularization and graph propagation. This guide covers the assumptions SSL relies on, major techniques, a Harbor document-classification worked example, a technique decision table, common pitfalls, and a production checklist.

Where semi-supervised learning sits in the ML landscape

Machine learning paradigms differ by how much supervision you can afford:

Supervised learning — every training example has a label. Accurate but expensive at scale.
Unsupervised learning — no labels; discover clusters, anomalies, or latent structure.
Semi-supervised learning — a labeled subset plus a large unlabeled set drawn from the same distribution.
Self-supervised learning — no manual labels, but automatic targets from the data itself (predict masked tokens, match augmented views).
Active learning — iteratively choose which unlabeled points to send for human annotation.

In practice these blend. A team might pre-train with self-supervision, fine-tune on 1,000 labels with FixMatch-style SSL, then run active learning to label the next thousand highest-value examples. The unifying goal is the same: reach supervised accuracy with less human effort.

Assumptions: when unlabeled data actually helps

SSL is not magic. Gains appear only when unlabeled examples share structure with labeled ones. Classical theory names three assumptions:

Smoothness assumption

Nearby points in input space should share labels. Small perturbations (crop, paraphrase, noise) should not flip the prediction. Consistency regularization methods encode this directly.

Cluster assumption

Data form clusters and points in the same cluster likely share a class. If unlabeled mass sits between labeled clusters of different classes, pseudo-labels will be wrong and self-reinforcing.

Manifold assumption

High-dimensional data lie on a lower-dimensional manifold. Decision boundaries should cut across low-density regions, not through dense clouds of unlabeled points. Graph-based SSL exploits this by propagating labels along edges connecting similar examples.

When these fail — heavy covariate shift between labeled and unlabeled pools, adversarial unlabeled data, or labels that do not correlate with geometry — SSL can hurt accuracy versus supervised-only baselines. Always benchmark against a strong supervised model trained on labels alone before trusting SSL gains.

Pseudo-labeling and self-training

The oldest SSL recipe is pseudo-labeling (also called self-training):

Train a model on labeled data only.
Run inference on unlabeled examples.
Assign the argmax class as a pseudo-label for confident predictions (e.g. probability above 0.9).
Retrain on labeled + pseudo-labeled data combined.
Repeat until convergence or a fixed number of rounds.

Simple and surprisingly strong when classes are separable and the initial model is decent — often after transfer learning from a pre-trained backbone. Failure modes are equally simple: one wrong high-confidence prediction poisons thousands of similar unlabeled points in the next round. Mitigations include confidence thresholds, class-balanced sampling of pseudo-labels, and stopping after one or two iterations instead of looping until collapse.

Co-training and multi-view learning

When each example has two independent views (web page text + HTML metadata, image + caption), train separate classifiers per view. Each view labels the other’s unlabeled data where it is confident. Co-training works when views are conditionally independent given the class — a strong assumption but effective on classic text+hyperlink tasks.

Consistency regularization

Modern deep SSL rarely trusts a single forward pass. Instead, the model must output consistent predictions under stochastic perturbations of the same input:

Π-model / Mean Teacher — student network sees augmented input; teacher network (EMA weight average of student) sees a different augmentation. Loss = supervised loss on labeled batch + MSE or KL between student and teacher predictions on unlabeled batch.
FixMatch — weak augmentation (flip, crop) produces pseudo-label if confidence exceeds threshold; strong augmentation (RandAugment, Cutout) must match that pseudo-label. Became a standard vision SSL baseline because it is simple and reproducible.
UDA, MixMatch, ReMixMatch — combine consistency with sharpening, mixing unlabeled pairs, and entropy minimization in various combinations.

The supervised term anchors the model to ground truth; the consistency term spreads label signal across the unlabeled manifold. Hyperparameters that matter: confidence threshold (too low → noise; too high → unused data), loss weight between supervised and unsupervised terms, and augmentation strength.

Entropy minimization and graph-based methods

Entropy minimization adds a loss encouraging sharp (low-entropy) predictions on unlabeled data — the model should commit, not hedge at 0.51/0.49 on every point. Combined with consistency, this pushes decision boundaries away from dense regions. Alone, it can collapse to predicting one class everywhere; never use it without a supervised anchor.

Graph-based SSL builds a k-nearest-neighbor graph over embeddings, then propagates labels via label spreading or harmonic functions. Strong when data are moderately sized and similarity is meaningful (document TF-IDF, shallow embeddings). Weak at web scale without approximate nearest neighbor indexes; deep consistency methods usually scale better for images and text with transformers.

Worked example: Harbor support ticket routing

Harbor Ops runs an internal help desk. Annotators have labeled 600 tickets into five queues (billing, technical, account, abuse, other). The warehouse holds 180,000 historical tickets with no labels. Goal: route new tickets automatically with at least 88% macro-F1.

Baseline

Fine-tune a small text encoder on labeled tickets only: 79% macro-F1 on a held-out 120-ticket test set. Classes like abuse (40 training examples) drag the average.

SSL pipeline

Embed all tickets with the same encoder (frozen first epoch, then joint).
Train with FixMatch-style loss: labeled batch uses true labels; unlabeled batch uses paraphrase + dropout as weak/strong views.
Confidence threshold 0.85 for pseudo-labels; cap pseudo-labels per class per epoch to prevent majority-class flooding.
Mix 10% random unlabeled examples each epoch (no pseudo-label) for exploration.

Result

After three epochs, macro-F1 rises to 86% — still short of target. The team layers active learning: 200 additional human labels on high-entropy abuse and billing tickets, then one more SSL training run → 89% macro-F1. Total human labels: 800 instead of the ~2,400 pure supervised labeling would have needed for the same score (estimated from learning curves).

Lesson: SSL is a multiplier on scarce labels, not a replacement for fixing class imbalance or ambiguous annotation guidelines.

Choosing a semi-supervised technique

Scenario	Start here	Why
Vision, >10k unlabeled, pre-trained CNN/ViT	FixMatch or Mean Teacher	Consistency + augmentations are mature and well-benchmarked
Text classification, transformer backbone	Pseudo-label + consistency (UDA-style)	Dropout and paraphrase views are cheap
Two natural views per example	Co-training	Independent views reduce confirmation bias
<5k total points, interpretable similarity	Graph label spreading	Simple, no deep training loop required
Zero labels initially	Self-supervised pre-training first	SSL needs some anchor labels — use SSL after seed or pre-train
Labels expensive, unlabeled pool still growing	SSL pre-train + active learning	SSL squeezes existing mass; AL picks next labels

Semi-supervised learning vs related paradigms

Paradigm	Labeled data needed	Unlabeled data role	Best when
Supervised only	Large	Ignored	Labels are cheap or dataset is small
Semi-supervised	Small seed	Direct training signal via pseudo-labels / consistency	Same distribution, abundant unlabeled data
Self-supervised	None for pretext	Creates pretext targets	Pre-train backbone before any downstream labels
Active learning	Grows iteratively	Queried for labeling	Label budget is the bottleneck, not compute

Anti-patterns

Confidence threshold too low — pseudo-label noise compounds each self-training round until the model drifts.
Ignoring class imbalance — majority-class pseudo-labels swamp rare classes; use per-class caps and weighted supervised loss.
Train/test leakage — unlabeled pool must not include test examples; SSL amplifies leakage into inflated metrics.
Distribution mismatch — unlabeled data from 2024 product lines with labels only from 2022 SKUs; SSL propagates stale decisions.
Skipping supervised baseline — report gain over labeled-only training, not only over random guessing.
Infinite self-training loops — two rounds with monitoring often beat ten rounds that amplify errors.
No calibration check — high softmax confidence does not mean correct; track calibration on a labeled validation slice.

Production checklist

Verify labeled and unlabeled pools share the same distribution (or document shift).
Establish a supervised-only baseline on the current label set.
Start from a transfer-learned or self-supervised backbone when possible.
Use confidence thresholds and per-class pseudo-label quotas.
Combine consistency loss with a strong supervised term — never unlabeled-only.
Hold out a frozen test set; monitor macro-F1 per class, not accuracy alone.
Limit self-training iterations; watch for pseudo-label error rate drift.
Plan hybrid with active learning if SSL plateaus below target.
Log pseudo-label acceptance rates and version models for reproducibility.

Key takeaways

Semi-supervised learning trains on a small labeled set plus a large unlabeled pool from the same distribution.
Gains depend on smoothness, cluster, and manifold assumptions — violated assumptions make SSL harmful.
Pseudo-labeling is the simple baseline; consistency regularization (FixMatch, Mean Teacher) is the modern deep-learning default.
SSL pairs naturally with transfer learning, self-supervision, and active learning in production pipelines.
Always benchmark against supervised-only training and guard against confirmation bias in self-training loops.