Guide

UMAP and t-SNE explained

Harbor Analytics trained a sentence-transformer encoder on support-ticket text and produced 768-dimensional embeddings for 42,000 active customers. Product wanted a slide showing whether churn risk clustered by industry. Running raw embeddings through PCA to two components produced an oval smear with no separation. t-SNE with perplexity=30 revealed structure but split one known enterprise segment into nine islands that did not match account labels. UMAP with n_neighbors=30 and min_dist=0.1 preserved global spacing, collapsed the false splits, and surfaced three coherent cohorts that aligned with ground-truth retention tiers. Neither method is magic — both are nonlinear manifold learning algorithms that map high-dimensional neighborhoods into 2D or 3D for human inspection. They do not replace modeling, but they are the fastest way to sanity-check whether your embeddings, sensor readings, or gene expression matrices contain separable structure before you commit to a classifier. This guide covers how t-SNE and UMAP work, how to tune perplexity and n_neighbors, when to PCA-preprocess, Python patterns with scikit-learn and umap-learn, a Harbor Analytics worked example, a method decision table, common pitfalls, and a production checklist.

What manifold learning assumes

Real data often lies on or near a lower-dimensional manifold embedded in a high-dimensional space. Customer behavior might be described by dozens of latent factors even when you measure hundreds of features. Manifold learning algorithms try to recover that latent geometry: points that are close in the original space should stay close in the plot, and points far apart should not be artificially pulled together.

Unlike linear PCA, t-SNE and UMAP can unfold curved, clustered manifolds — which is why they dominate embedding visualization in NLP, genomics, and computer vision. The trade-off is that distances between clusters in a 2D plot are often meaningless: only local neighborhoods are trustworthy. Treat the output as exploratory, not as features you feed directly into a production classifier without validation.

t-SNE step by step

t-Distributed Stochastic Neighbor Embedding (t-SNE), popularized by van der Maaten and Hinton (2008), converts high-dimensional pairwise similarities into a 2D map that minimizes divergence between two probability distributions.

High-dimensional similarities

For each point i, t-SNE centers a Gaussian over all other points and computes conditional probabilities pj|i that point j is neighbor of i. The bandwidth of that Gaussian is set so each point has roughly the same effective number of neighbors — controlled by perplexity, typically 5–50. Perplexity is the knob everyone tunes first: too low and you see tiny fragmented clusters; too high and local structure washes out into a blob.

Low-dimensional map

In 2D, t-SNE uses a Student-t distribution (heavier tails than a Gaussian) to compute qij, which reduces the “crowding problem” where moderately distant high-dimensional points get crushed together in a low-dimensional layout. Optimization minimizes the Kullback-Leibler divergence KL(P||Q) with gradient descent, usually over 1,000+ iterations. Early exaggeration (multiplying P by 12 for the first 250 iterations) helps separated clusters form before fine-tuning.

scikit-learn usage:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate='auto',
              init='pca', random_state=42)
coords = tsne.fit_transform(X)

Always set random_state for reproducibility. t-SNE layouts vary run-to-run because initialization and stochastic optimization differ.

UMAP step by step

Uniform Manifold Approximation and Projection (UMAP) (McInnes, Healy, Melville, 2018) builds a weighted k-nearest-neighbor graph in the original space, treats it as a fuzzy topological representation of the manifold, and optimizes a low-dimensional layout with cross-entropy loss.

Key hyperparameters

  • n_neighbors (default 15) — analogous to perplexity. Lower values emphasize fine local structure; higher values preserve more global topology. Harbor's customer embeddings used 30.
  • min_dist (default 0.1) — minimum separation between points in the embedding. Near 0 packs clusters tightly; toward 1 spreads points more uniformly.
  • metric — defaults to Euclidean; cosine distance is standard for text embeddings and often matches semantic similarity better.
  • n_components — 2 for slides, 3 for interactive rotation, occasionally 10–50 as a nonlinear preprocessing step (use with caution).

umap-learn usage:

import umap
reducer = umap.UMAP(n_neighbors=30, min_dist=0.1, metric='cosine',
                  random_state=42)
coords = reducer.fit_transform(X)

UMAP is typically faster than t-SNE on large datasets (100k+ rows) and scales better when you use the low_memory=True option. It also supports transform() on held-out points after fit(), which t-SNE does not do natively — useful when you want a fixed projection for new arrivals.

t-SNE vs UMAP in practice

Aspect t-SNE UMAP
Local cluster detail Excellent; tight, well-separated islands Very good; slightly less aggressive separation
Global structure Poor — cluster distances and sizes are unreliable Better — relative spacing more meaningful
Speed at 100k+ points Slow; Barnes-Hut helps but still costly Faster with approximate nearest neighbors
Out-of-sample projection Not supported (refit or use parametric variants) transform() on new points after fit
Typical use Publication-quality cluster plots, small N Exploratory dashboards, larger N, topology-aware layouts

Run both on a stratified subsample when stakes are high. If they disagree, the structure may be weak or hyperparameters need tuning — not every dataset has clean clusters.

Preprocessing that actually matters

Manifold methods are sensitive to scale. Always standardize continuous features (zero mean, unit variance) unless you are using cosine metric on already-normalized embeddings.

  • PCA to 50 components first — common pipeline for t-SNE/UMAP on 768-dim embeddings. Removes noise dimensions and speeds optimization without destroying neighborhood structure.
  • Subsampling — t-SNE on 500k points is painful. Plot a random 10k–50k subsample; verify cluster labels on the full set with clustering metrics separately.
  • Remove duplicates — identical rows create degenerate k-NN graphs and phantom micro-clusters.
  • Do not use 2D coordinates as model features without cross-validation — t-SNE/UMAP leak structure from the full dataset into the layout. Fit on train only if you must reduce dimension for modeling.

Harbor Analytics customer embedding visualization

Harbor's pipeline: MiniLM sentence encoder on the last five support tickets per customer, mean-pooled to 768 dimensions, 42,000 rows. Goal: validate whether churn labels (active, at-risk, churned) were linearly separable before building a gradient-boosted classifier.

PCA (2 components): explained variance 18% + 9%; silhouette score on churn labels 0.04 — no visible separation.

t-SNE (perplexity=30, PCA-50 init): twelve visually distinct islands; adjusted Rand index vs churn labels 0.31. Manual review showed islands split by ticket length, not business segment — a textbook case of t-SNE over-splitting on nuisance variation.

UMAP (n_neighbors=30, min_dist=0.1, cosine, PCA-50): three broad regions with fuzzy boundaries; adjusted Rand index 0.52; at-risk customers formed a bridge between active and churned zones, matching the product hypothesis. The team used UMAP for stakeholder slides and trained the classifier on raw 768-dim embeddings (where AUC was higher than on 2D coords). The plot saved two weeks of arguing about whether churn was predictable at all.

Method decision table

Method Preserves Best when
PCA Global variance, linear structure Preprocessing, linear correlations, baseline scatter
t-SNE Local neighborhoods Small N, cluster visualization, publication figures
UMAP Local + some global topology Larger N, dashboards, out-of-sample transform needed
Autoencoder bottleneck Learned nonlinear compression Same model family as downstream deep net; GPU available
PaCMAP / TriMAP Alternative topology balance t-SNE and UMAP disagree; research comparisons

For modeling pipelines, prefer PCA or supervised methods. For anomaly detection on embeddings, combine UMAP plots with isolation scores rather than eyeballing outliers alone.

Common pitfalls

  • Reading cluster distance as similarity. A wide gap between t-SNE blobs does not mean the classes are far apart in the original space.
  • Default perplexity on tiny data. Perplexity must be less than n_samples; use 5–10 on hundreds of points, 30–50 on tens of thousands.
  • Skipping PCA on high-dim embeddings. Running t-SNE directly on 768 dimensions adds noise and runtime with little benefit.
  • Different random seeds, different stories. Fix random_state and document hyperparameters in every figure caption.
  • Coloring by a nuisance variable. If islands align with sequence length, token count, or batch ID, you are visualizing artifacts.
  • Using 2D coords in train/test splits incorrectly. Fit manifold on training data only; projecting test data is UMAP-only and still requires care.
  • Expecting linearly separable classes to appear. Overlapping manifolds stay overlapping in 2D — the plot confirms overlap, not a bug.
  • Replacing quantitative metrics with plots. Always pair visuals with silhouette, ARI, or downstream classifier AUC.

Production checklist

  • Standardize or use cosine metric appropriate to embedding type.
  • Optionally PCA to 50 components when input dimension exceeds 100.
  • Subsample to 50k rows for interactive exploration if N is larger.
  • Grid-search perplexity (t-SNE) or n_neighbors (UMAP) on {5, 15, 30, 50}.
  • Run both t-SNE and UMAP; note agreement and disagreement in the report.
  • Color by known labels and by suspected nuisance covariates (batch, length).
  • Compute ARI or silhouette on held-out clustering — not on the 2D layout alone.
  • Fix random seeds; export hyperparameters alongside every PNG/SVG.
  • Do not ship 2D coordinates as production features without CV validation.
  • Archive the subsample, parameters, and library versions for reproducibility.

Key takeaways

  • t-SNE optimizes local neighborhoods via KL divergence; perplexity controls effective neighbor count.
  • UMAP preserves more global topology and scales better; n_neighbors and min_dist are the main tuning knobs.
  • PCA preprocessing to 50 components is standard for high-dimensional embeddings before either method.
  • Plots are for exploration — cluster spacing is not a distance metric; validate with quantitative scores.
  • Harbor's workflow — UMAP for stakeholder trust, raw embeddings for the actual classifier — is the right split.

Related reading