Guide

Topic modeling and LDA explained

Harbor Support's knowledge base had 14,000 articles written over eight years by dozens of authors. Nobody maintained a taxonomy. When the routing team asked “which themes spike after a deploy?” there were no labels to query. A data scientist ran latent Dirichlet allocation (LDA) on article titles and bodies and surfaced twelve coherent themes — password resets, webhook retries, invoice disputes — in an afternoon. No hand-labeling required. That is topic modeling: unsupervised discovery of recurring themes in a document collection. LDA, introduced by Blei, Ng, and Jordan in 2003, remains the most widely taught probabilistic formulation. This guide explains the generative story behind LDA, Dirichlet priors, inference (Gibbs sampling and variational Bayes), a practical sklearn workflow, choosing the number of topics, coherence evaluation, a Harbor Support KB discovery worked example, a method decision table, common pitfalls, and a production checklist. It complements TF-IDF feature weighting, supervised text classification, and semantic search without replacing labeled training when you already have categories.

What topic modeling does

Given a corpus of documents, topic modeling assigns each document a distribution over latent topics and each topic a distribution over words. Topics are not predefined labels; they are statistical clusters of co-occurring terms. A topic might look like {refund: 0.08, chargeback: 0.06, invoice: 0.05, …} even though no human named it “billing disputes.”

Topic models assume a bag-of-words representation: word order is discarded, and each document is a multiset of tokens (often after lowercasing, stop-word removal, and stemming or lemmatization). That simplification hurts tasks needing syntax but is surprisingly effective for theme discovery across thousands of short-to-medium texts.

Topic modeling vs classification vs clustering

Supervised classification needs labeled examples per class. Use it when categories are known and stable.
Hard clustering (k-means on TF-IDF) assigns each document to one cluster. Fast but brittle when documents genuinely span multiple themes.
Topic modeling (LDA, NMF) allows soft membership: a ticket can be 60% “login errors” and 40% “MFA setup.”

The LDA generative process

LDA is a generative model: it describes how documents might have been created if an author first picked a topic mixture, then sampled words from each topic's word distribution. Inference runs the process backward — given observed words, infer likely topic mixtures.

Plate notation in plain language

Choose the number of topics K (hyperparameter you set).
For each topic k, draw a word distribution φ_k ~ Dirichlet(β) over the vocabulary.
For each document d, draw a topic mixture θ_d ~ Dirichlet(α) over K topics.
For each word position in document d:
- Draw a topic assignment z ~ Multinomial(θ_d).
- Draw a word w ~ Multinomial(φ_z).

α controls document-topic sparsity: small α pushes each document toward few dominant topics; large α spreads mass evenly. β (or eta in sklearn) controls topic-word sparsity: small β yields peaked topics with a handful of signature words; large β produces diffuse topics. Defaults often work; tune when topics look too generic or too fragmented.

What you get after inference

Document-topic matrix θ (shape: num_docs × K) — how much each document belongs to each topic.
Topic-word matrix φ (shape: K × vocab_size) — top words per topic for human interpretation.

Inference: Gibbs sampling and variational Bayes

Exact inference in LDA is intractable. Two practical approximations dominate:

Collapsed Gibbs sampling

Iteratively reassigns each word's topic conditioned on all other assignments. Simple to implement, high quality on small corpora, but slow on millions of documents because each word is updated serially. Good for research prototypes and teaching.

Variational inference (online VB)

Optimizes a factorized variational distribution to approximate the true posterior. sklearn's LatentDirichletAllocation uses online variational Bayes — scalable minibatch updates suitable for large corpora. Set learning_method='online' (default) for big data; 'batch' for smaller sets where full passes are affordable.

Modern alternatives like BERTopic embed documents with transformer encoders, cluster in embedding space, then extract c-TF-IDF keywords per cluster. They often produce more semantically coherent topics on short social text but cost more compute and lose the clean generative interpretability of LDA.

sklearn workflow

A minimal pipeline chains preprocessing, vectorization, and LDA:

Clean text — lowercase, remove HTML, normalize unicode, optionally strip stop words. Keep domain terms (“2FA”, “webhook”) that generic stop lists would delete.
Vectorize — CountVectorizer with max_df (drop terms in >95% of docs) and min_df (drop ultra-rare noise). LDA expects non-negative counts; do not feed L2-normalized TF-IDF unless using a variant designed for it.
Fit LDA — LatentDirichletAllocation(n_components=K, max_iter=20, random_state=0). Inspect components_ for topic-word weights and transform() for document-topic weights.
Label topics manually — print top-10 words per topic; assign human-readable names for dashboards.

Key hyperparameters: n_components (K), doc_topic_prior (α), topic_word_prior (β), max_iter, learning_decay. Increase max_iter if log-likelihood has not plateaued; watch for overfitting on tiny corpora.

Choosing the number of topics K

There is no single correct K. Start with domain intuition (how many themes would a human analyst name?), then validate:

Topic coherence — NPMI or C_v scores measure whether top words in a topic co-occur in real documents. Higher is better. Libraries like Gensim provide coherencemodel; sweep K from 5 to 50 and plot coherence vs K. Look for an elbow, not the absolute maximum (coherence can overfit small K).
Human audit — have a domain expert score 20 random topics for interpretability on a 1–5 scale. Uninterpretable topics signal wrong K or bad preprocessing.
Downstream utility — if topics feed routing rules, measure precision@k on a small hand-labeled validation set rather than optimizing coherence alone.
Perplexity — held-out log-likelihood; lower is better in theory but often keeps increasing with K, encouraging too many topics. Use as a secondary signal, not the primary metric.

Worked example: Harbor Support KB discovery

Harbor Support had 14,200 internal KB articles (median 420 words). The routing team wanted proactive alerts when a new deploy increased volume in any theme.

Preprocessing — stripped code blocks and nav boilerplate; kept product names; CountVectorizer(max_df=0.90, min_df=5, ngram_range=(1,2)) yielded 18,400 features.
K sweep — trained LDA for K ∈ {8, 12, 16, 20, 24}; C_v coherence peaked at K=16 with acceptable interpretability scores from two support leads.
Topic labels — examples: Topic 3 {password, reset, login, locked, account}; Topic 11 {webhook, retry, 429, endpoint, payload}; Topic 14 {invoice, refund, chargeback, billing, proration}.
Production use — nightly job assigns each new ticket to argmax topic; Grafana panel tracks topic share vs 28-day baseline; spikes >2σ page on-call. Topics are refreshed quarterly with full refit; incremental assignment uses frozen model between refits.

The LDA pass did not replace the supervised ticket classifier for auto-routing high-confidence cases. It provided discovery and monitoring for themes nobody had labeled yet.

Decision table: which topic method when

Situation	Prefer	Why
Large unlabeled corpus, need soft multi-topic assignments	LDA (variational)	Scalable, probabilistic, well-understood priors
Short tweets or headlines, semantic nuance matters	BERTopic or embedding clustering	Context beyond bag-of-words; higher compute cost
Non-negative matrix factorization on TF-IDF, fast baseline	NMF	Simpler optimization; topics as weighted word sets
Known fixed categories with labeled training data	Supervised classifier	Higher accuracy when labels exist and stay stable
Need one cluster per document, hard partitions	k-means on TF-IDF or embeddings	Fast; loses mixed-membership nuance
Streaming corpus, vocabulary drifts monthly	Periodic full refit + versioned topic IDs	Topic indices are not stable across refits without alignment

Common pitfalls

Stop-word stripping too aggressive — removes discriminative short tokens in technical corpora.
TF-IDF fed to standard LDA — negative values break multinomial assumptions; use counts or specialized models.
Too few documents — LDA on 200 tweets with K=30 yields nonsense topics; reduce K or pool more text.
Ignoring document length — long documents dominate word counts; consider trimming boilerplate or using asymmetric priors.
Stable topic IDs across retrains — topic index 7 after refit is not the same theme as topic 7 before; align with keyword overlap or optimal transport if dashboards depend on IDs.
Equating topics with actionable categories — some topics are junk (“thanks”, “hello”, “regards”); filter by coherence or manual review before automating.
Skipping multilingual handling — mixing languages in one model creates polyglot garbage topics; segment by language first.

Production checklist

Define preprocessing rules and version them with the model artifact.
Sweep K with coherence and human interpretability scores on a sample.
Document top words and human labels for every production topic.
Hold out a validation slice for downstream routing or alert precision.
Schedule periodic refits; plan topic ID migration for dashboards.
Monitor for topic drift when product vocabulary changes (new feature names).
Pair unsupervised topics with a small labeled set when automating decisions.
Store count matrices or vectorizer vocab alongside LDA weights for reproducibility.

Key takeaways

LDA discovers soft thematic structure — documents mix topics; topics mix words.
Bag-of-words is a deliberate trade-off — fast and interpretable; loses word order.
Choosing K is part science, part domain judgment — coherence plus human audit beats perplexity alone.
Topic modeling complements labeled classifiers — discovery and monitoring, not always routing.
Refit policies matter — topic indices and vocabulary drift unless you version models.