Guide

TF-IDF explained

Harbor Support's routing team had 38,000 historical tickets labeled into four queues — billing, technical, account access, and general inquiry. A product manager wanted a transformer fine-tune “because that's what modern NLP uses,” but the baseline had to ship in a week on a CPU-only cluster. Engineers started with the oldest trick in the book: count how often each word appears, then down-weight words that appear everywhere. That is TF-IDF (term frequency–inverse document frequency) — a sparse vector representation that turns documents into weighted word-count profiles. Paired with a linear logistic regression classifier in scikit-learn, the pipeline hit 91.2% macro-F1 on held-out tickets, trained in under four minutes, and explained misroutes via the highest-weighted terms per class. TF-IDF does not understand negation or long-range context the way transformers do, but for short support messages, product reviews, and spam filters it remains the fastest path from raw text to a production baseline. This guide covers the math intuition, bag-of-words and n-gram extensions, sklearn TfidfVectorizer knobs, a Harbor Support router worked example, a representation decision table, common pitfalls, and a deployment checklist. For the broader NLP pipeline, see NLP fundamentals explained.

From raw text to weighted word counts

Machine learning models need numbers, not strings. The simplest representation is bag-of-words (BoW): discard word order, build a vocabulary of unique tokens across the corpus, and represent each document as a vector of counts (or binary presence flags). “Refund my invoice” might map to [refund: 1, invoice: 1, my: 1] in a vocabulary of tens of thousands of terms.

Raw counts have a problem: frequent words dominate. The word “the” appears in almost every English document; “chargeback” appears almost only in billing disputes. TF-IDF rescales each term so rare, discriminative words score higher and corpus-wide filler words score near zero.

Term frequency (TF)

Term frequency measures how often term t appears in document d. The raw count works, but sklearn and most textbooks use a normalized variant so longer documents do not automatically inflate every TF:

tf(t, d) = count(t, d) / total_terms(d)

Sublinear TF scaling (sublinear_tf=True in sklearn) applies 1 + log(tf) when tf > 0, dampening the impact of a word repeated twenty times versus twice — often helpful on noisy web text and ticket dumps where users paste the same error line repeatedly.

Inverse document frequency (IDF)

Inverse document frequency penalizes terms that appear in many documents. If every ticket mentions “help,” it carries little routing signal. The classic smoothed formula (what sklearn uses by default) is:

idf(t) = log((1 + N) / (1 + df(t))) + 1

where N is the total number of documents and df(t) is the document frequency — how many documents contain term t at least once. Rare terms get large IDF weights; ubiquitous terms approach the minimum.

The TF-IDF score

Multiply the two components:

tfidf(t, d) = tf(t, d) × idf(t)

The result is a sparse vector: most entries are zero because any single document uses only a fraction of the vocabulary. Linear models (SVM, logistic regression, naive Bayes) thrive on this sparsity — training and inference stay fast even with 50,000+ features.

N-grams, character features, and vocabulary control

Unigram BoW loses word order, which hurts phrases like “not working” versus “working fine.” N-grams extend the vocabulary to contiguous token sequences:

  • Unigrams — single tokens: refund, invoice
  • Bigrams — pairs: refund_request, invoice_number
  • Trigrams — triples: reset_my_password

sklearn's TfidfVectorizer(ngram_range=(1, 2)) is the most common production setting for short text: unigrams capture topic words; bigrams capture short phrases without exploding dimensionality the way trigrams on long documents can. Character n-grams (3–5 characters) help with typos, product SKUs, and multilingual code-switching where word tokenizers break.

Vocabulary limits and pruning

Real corpora produce millions of unique tokens. Practical pipelines cap the vocabulary with max_features (keep top-N by corpus term frequency) or min_df / max_df thresholds (drop terms appearing in fewer than 2 documents or in more than 95% of documents). stop_words removes language-specific function words; for text classification on short messages, removing “the” and “is” often helps TF-IDF, but for sentiment or negation-heavy tasks keep stop words and rely on n-grams instead.

L2 normalization

sklearn applies L2 normalization to each document vector by default (norm='l2'), scaling every document to unit length. That makes TF-IDF scores comparable across short and long documents and aligns the geometry with cosine similarity — the default metric for nearest-neighbor retrieval on sparse vectors.

sklearn TfidfVectorizer in practice

The TfidfVectorizer class in scikit-learn wraps tokenization, vocabulary building, and matrix construction into one estimator that fits into Pipeline objects:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95,
        sublinear_tf=True,
        strip_accents="unicode",
    )),
    ("lr", LogisticRegression(max_iter=1000, class_weight="balanced")),
])

Key parameters teams tune in production:

  • analyzer'word' (default), 'char', or 'char_wb' for character n-grams bounded to word edges.
  • token_pattern — regex for word tokens; widen it to keep hyphenated product codes.
  • dtypefloat32 halves memory on large corpora with negligible accuracy loss.
  • vocabulary — pass a fixed dict at inference so training and serving use identical feature indices.

The output is a scipy sparse matrix (CSR format) — store the fitted vectorizer alongside the classifier with joblib and version both artifacts. Never refit IDF statistics on live traffic without a controlled retrain; drifting vocabulary shifts score distributions silently.

Worked example: Harbor Support ticket router

Harbor Support exported 38,000 labeled tickets (median length 42 words). The team's goal: auto-route 70% of incoming volume with 95%+ precision per queue, escalating uncertain tickets to humans.

Preprocessing choices

  • Lowercased text; kept punctuation in error codes like ERR-4021.
  • ngram_range=(1, 2), min_df=3, max_df=0.92 — dropped terms in fewer than 3 tickets or in more than 92% of tickets.
  • English stop_words removed; bigrams like two factor still captured 2FA intent.
  • class_weight='balanced' on logistic regression because the general-inquiry class was 2.3x larger than billing.

Results and interpretability

Held-out macro-F1: 91.2% (transformer fine-tune on DistilBERT reached 93.1% but needed GPU training and 12x slower inference). Top weighted features per class were inspectable: billing leaned on invoice, charge, refund; technical on error, crash, api timeout; account access on password, login, two factor. Misroutes clustered on multi-intent messages (“can't log in to pay my invoice”) — fixed by adding a secondary rule: if top two class probabilities are within 0.15, escalate to human review regardless of argmax.

Deployment pattern

The vectorizer vocabulary froze at 24,800 features. Inference: transform incoming ticket text to a sparse row, multiply by the coefficient matrix, softmax for probabilities. p99 latency under 8 ms on a single core — suitable for synchronous webhook routing at 400 tickets per minute.

TF-IDF vs dense embeddings

Scenario TF-IDF + linear model Better alternative
Short labeled text (support, spam, tags) Fast baseline, interpretable weights Fine-tuned transformer if budget allows
Semantic similarity (“car” vs “automobile”) No shared dimensions — treats as unrelated Word embeddings or LLM embeddings
Negation and long syntax Weak unless bigrams capture phrases Transformer encoders
Keyword-heavy retrieval on millions of docs BM25 / TF-IDF with inverted indexes Semantic search for paraphrase
Tiny labeled set (< 200 per class) Overfits rare n-grams Zero-shot LLM or transfer learning
Multilingual without translation Separate vocab per language Multilingual sentence encoders

Many production systems use hybrid retrieval: TF-IDF or BM25 for exact keyword hits, dense vectors for paraphrase — then merge ranked lists. Do not retire TF-IDF because transformers exist; benchmark both on your corpus and latency budget.

Common pitfalls

  • Data leakage in IDF — fitting TfidfVectorizer on the full dataset including test rows inflates scores. Fit only on training folds inside cross-validation pipelines.
  • Vocabulary drift — new product names unseen at train time get silently dropped at inference. Monitor out-of-vocabulary rate and schedule retrains.
  • Trigram explosionngram_range=(1, 3) on long documents creates millions of features and overfits. Cap with max_features.
  • Ignoring class imbalance — raw accuracy hides majority-class bias. Use class_weight, macro-F1, or stratified sampling.
  • Comparing TF-IDF to transformers unfairly — tune the linear baseline (n-grams, min_df, regularization) before declaring transformers necessary.
  • Storing dense matrices — calling .toarray() on a million-row corpus blows RAM. Keep CSR sparse format end-to-end.

Production checklist

  • Establish a TF-IDF + linear baseline before GPU fine-tuning.
  • Fit vectorizer inside Pipeline with cross-validation to prevent leakage.
  • Log vocabulary size, sparsity, and top features per class for auditability.
  • Freeze vocabulary and IDF weights; version artifacts with training date.
  • Set confidence thresholds for human escalation on low-margin predictions.
  • Track macro-F1 and per-class recall on a weekly labeled sample.
  • Plan hybrid upgrade path to embeddings when paraphrase errors dominate error analysis.
  • Benchmark p95 inference latency under expected peak QPS.

Key takeaways

  • TF-IDF weights word importance — frequent-in-document, rare-in-corpus terms score highest.
  • Sparsity is a feature — linear models on TF-IDF are fast, cheap, and interpretable.
  • N-grams recover short phrases — bigrams are the usual sweet spot for tickets and reviews.
  • It is a baseline, not a ceiling — beat it fairly before jumping to transformers.
  • Hybrid stacks win retrieval — combine sparse keyword signals with dense semantic search.

Related reading