Guide

Unsupervised learning and clustering explained

A fraud team receives millions of transactions but only a few thousand confirmed fraud labels. A product manager wants to group users by behavior before anyone has tagged personas. A data scientist needs to compress fifty correlated sensor columns into a handful of interpretable axes. Unsupervised learning solves problems where the target variable is unknown, expensive to label, or does not exist yet — and clustering is its most widely deployed technique: partition data so that points within a group are more similar to each other than to points in other groups. Unlike supervised learning, there is no ground-truth answer to optimize against, which makes algorithm choice, preprocessing, and evaluation subtler. This guide covers K-means, hierarchical and density-based clustering, how to pick the number of clusters, dimensionality reduction with PCA, real production use cases, and the failure modes that turn pretty scatter plots into bad business decisions.

Supervised vs unsupervised learning

In supervised learning you learn a mapping f(features) → label from labeled examples: spam vs not spam, click vs no click, price tomorrow. The loss function compares predictions to known answers. Unsupervised learning instead searches for structure in X alone: clusters, latent factors, anomalies, or compressed representations.

Common unsupervised tasks beyond clustering include:

Dimensionality reduction — PCA, t-SNE, UMAP compress high-dimensional data while preserving variance or neighborhood structure.
Anomaly detection — isolation forests, autoencoders, and DBSCAN noise points flag outliers without a labeled "fraud" class.
Association rules — market-basket analysis finds items frequently bought together.
Representation learning — word and image embeddings ( LLM embeddings ) learn dense vectors where semantic similarity is geometric proximity.

Clustering is often the first exploratory step: segment users, discover document topics, group inventory SKUs, or pre-label data for a later supervised model. It also powers retrieval systems when combined with vector databases — nearest-neighbor search is clustering at query time.

K-means clustering step by step

K-means is the workhorse algorithm: fast, interpretable, and easy to implement. Given n points in d dimensions and a chosen cluster count K, it iterates:

Initialize K centroids — random data points, k-means++, or domain-informed seeds.
Assign each point to the nearest centroid (usually Euclidean distance).
Update each centroid to the mean of its assigned points.
Repeat assign/update until assignments stabilize or a max iteration count is reached.

K-means minimizes within-cluster sum of squares (inertia): the total squared distance from each point to its centroid. Lower inertia means tighter clusters — but inertia always decreases as K increases (with K = n, inertia is zero), so you cannot pick K by inertia alone.

Strengths: O(n · K · d · iterations) scalability, simple outputs (centroid profiles you can name and action). Weaknesses: assumes roughly spherical, equal-variance clusters; sensitive to feature scale; requires you to specify K upfront; struggles with non-convex shapes and varying density.

Always standardize numeric features before K-means — otherwise a column measured in dollars dominates one measured in clicks. Feature engineering (log transforms, one-hot encoding categoricals then treating as numeric, or embedding categoricals) dramatically changes cluster quality.

Choosing K and evaluating without labels

Without ground-truth cluster IDs, evaluation is heuristic but not arbitrary. Common approaches:

Elbow method

Plot inertia vs K. The "elbow" where marginal improvement flattens suggests a reasonable K. Subjective on noisy data — use it as a starting point, not a verdict.

Silhouette score

For each point, silhouette measures how much closer it is to its own cluster than to the nearest other cluster, averaged across points. Range [-1, 1]; higher is better. Compare silhouette across candidate K values and algorithms. Unlike inertia, silhouette penalizes overlapping clusters.

Business interpretability

The best K is often the one stakeholders can act on. Five customer segments with distinct average order values and channel mix beat twelve micro-clusters that no marketing campaign can target. Profile each cluster: mean/median of key metrics, top categorical modes, sample size.

Stability checks

Re-run K-means with different random seeds. If assignments swing wildly, clusters are not robust — reduce dimensions, try another algorithm, or gather more data. This mirrors cross-validation thinking: a structure that only appears once is not trustworthy.

Hierarchical clustering

Agglomerative hierarchical clustering starts with each point as its own cluster and repeatedly merges the two closest clusters until one remains. The result is a dendrogram — a tree showing merge order. Cut the tree at a height to get any K without re-running the algorithm.

Linkage rules define "closest cluster":

Single linkage — minimum distance between any pair; prone to chaining elongated clusters.
Complete linkage — maximum pairwise distance; favors compact clusters.
Ward linkage — merges clusters that minimize increase in total variance; often best for Euclidean data.

Complexity is roughly O(n²) memory and O(n³) time for naive implementations — fine for thousands of rows, impractical for millions without approximation. Use hierarchical clustering when you need the dendrogram for exploration, have small n, or want to avoid committing to K before inspecting the tree.

DBSCAN: density-based clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed and marks sparse points as noise. Two hyperparameters:

eps — neighborhood radius.
min_samples — minimum points within eps to form a core point.

DBSCAN discovers clusters of arbitrary shape, automatically estimates cluster count, and explicitly handles outliers — valuable for geospatial data, network intrusion, and sensor anomalies. It fails when clusters have very different densities (use HDBSCAN variant) or when eps is hard to tune in high dimensions.

The curse of dimensionality hurts all distance-based methods: in high-D space, all points become roughly equidistant. Reduce dimensions with PCA or use embeddings before clustering text or images.

PCA: preprocessing for clustering

Principal Component Analysis finds orthogonal axes that capture maximum variance. Projecting 50 correlated features onto the top 5–10 principal components often improves clustering by removing noise and collinearity.

Workflow:

Standardize features.
Fit PCA on training data; retain components explaining ~85–95% variance or use scree plot elbow.
Cluster in PCA space; interpret clusters by inspecting loadings (which original features drive each component).

PCA is linear — non-linear structure may need UMAP or autoencoder bottlenecks. For visualization only, t-SNE and UMAP are popular but distances in 2D plots are not faithful to high-D geometry; do not cluster on t-SNE coordinates for production decisions.

Production use cases

Customer segmentation — RFM (recency, frequency, monetary) features clustered into personas for email and pricing tests.
Document and support ticket grouping — embed text, cluster embeddings, route to specialized teams or auto-draft FAQs.
Image deduplication — perceptual hashes or CNN embeddings clustered to find near-duplicate assets.
Semi-supervised bootstrapping — cluster unlabeled data, manually label one exemplar per cluster, train a classifier.
Anomaly monitoring — DBSCAN noise flags or small distant clusters trigger alerts before a supervised fraud model retrains.

Cluster IDs are not permanent product attributes unless you version and retrain on a schedule. User behavior drifts; centroids from last quarter may mis-segment this quarter. Treat cluster assignment as a batch pipeline with documented refresh cadence and backward-compatible ID mapping when centroids shift.

Failure modes

Unscaled features — one dominant column defines all clusters.
Arbitrary K without validation — pretty charts, useless segments.
High-dimensional sparse data — Euclidean distance breaks; use cosine on normalized vectors or specialized methods.
Leaking future information — clustering on post-outcome features (e.g., lifetime value including a campaign you are evaluating) contaminates analysis.
Over-interpreting noise — small clusters may be artifacts; enforce minimum cluster size for action.
Ignoring class imbalance in downstream models — rare clusters need oversampling or separate models when you later predict within segment.

Production checklist

Define the business question — segmentation, exploration, or anomaly — before picking an algorithm.
Audit and engineer features; standardize or normalize numeric columns consistently in train and serve.
Try PCA or embeddings when d > 20 or features are correlated text/image vectors.
Compare K-means, hierarchical (small n), and DBSCAN with silhouette and stability across seeds.
Profile each cluster with interpretable aggregates; reject clusters too small to operationalize.
Document K, hyperparameters, random seed, and training data snapshot version.
Schedule periodic re-clustering; monitor cluster size drift and silhouette over time.
If cluster ID feeds a downstream model, treat reassignment as a data migration with A/B validation.
For text/image, prefer embedding + cosine K-means over raw bag-of-words.
Validate with held-out labeled data when any labels exist — even 1% labels enable normalized mutual information.

Key takeaways

Unsupervised learning finds structure without labels — clustering is the most common entry point.
K-means is fast and interpretable but needs scaled features, a chosen K, and roughly spherical clusters.
Hierarchical clustering explores merge trees; DBSCAN handles arbitrary shapes and noise.
Evaluate with silhouette, stability, and business actionability — not inertia alone.
PCA and embeddings are preprocessing steps, not optional decoration, in high dimensions.