Guide
TF-IDF explained
Harbor Support's routing team had 38,000 historical tickets labeled into four
queues — billing, technical, account access, and general inquiry. A product
manager wanted a transformer fine-tune “because that's what modern NLP
uses,” but the baseline had to ship in a week on a CPU-only cluster.
Engineers started with the oldest trick in the book: count how often each word
appears, then down-weight words that appear everywhere. That is
TF-IDF (term frequency–inverse document frequency) —
a sparse vector representation that turns documents into weighted word-count
profiles. Paired with a linear
logistic regression
classifier in
scikit-learn,
the pipeline hit 91.2% macro-F1 on held-out tickets, trained in under four
minutes, and explained misroutes via the highest-weighted terms per class. TF-IDF
does not understand negation or long-range context the way
transformers
do, but for short support messages, product reviews, and spam filters it remains
the fastest path from raw text to a production baseline. This guide covers the
math intuition, bag-of-words and n-gram extensions, sklearn
TfidfVectorizer knobs, a Harbor Support router worked example, a
representation decision table, common pitfalls, and a deployment checklist. For
the broader NLP pipeline, see
NLP fundamentals explained.
From raw text to weighted word counts
Machine learning models need numbers, not strings. The simplest representation
is bag-of-words (BoW): discard word order, build a vocabulary
of unique tokens across the corpus, and represent each document as a vector of
counts (or binary presence flags). “Refund my invoice” might map to
[refund: 1, invoice: 1, my: 1] in a vocabulary of tens of
thousands of terms.
Raw counts have a problem: frequent words dominate. The word “the” appears in almost every English document; “chargeback” appears almost only in billing disputes. TF-IDF rescales each term so rare, discriminative words score higher and corpus-wide filler words score near zero.
Term frequency (TF)
Term frequency measures how often term t appears in document d. The raw count works, but sklearn and most textbooks use a normalized variant so longer documents do not automatically inflate every TF:
tf(t, d) = count(t, d) / total_terms(d)
Sublinear TF scaling (sublinear_tf=True in
sklearn) applies 1 + log(tf) when tf > 0, dampening the
impact of a word repeated twenty times versus twice — often helpful on
noisy web text and ticket dumps where users paste the same error line repeatedly.
Inverse document frequency (IDF)
Inverse document frequency penalizes terms that appear in many documents. If every ticket mentions “help,” it carries little routing signal. The classic smoothed formula (what sklearn uses by default) is:
idf(t) = log((1 + N) / (1 + df(t))) + 1
where N is the total number of documents and df(t) is the document frequency — how many documents contain term t at least once. Rare terms get large IDF weights; ubiquitous terms approach the minimum.
The TF-IDF score
Multiply the two components:
tfidf(t, d) = tf(t, d) × idf(t)
The result is a sparse vector: most entries are zero because any single document uses only a fraction of the vocabulary. Linear models (SVM, logistic regression, naive Bayes) thrive on this sparsity — training and inference stay fast even with 50,000+ features.
N-grams, character features, and vocabulary control
Unigram BoW loses word order, which hurts phrases like “not working” versus “working fine.” N-grams extend the vocabulary to contiguous token sequences:
- Unigrams — single tokens:
refund,invoice - Bigrams — pairs:
refund_request,invoice_number - Trigrams — triples:
reset_my_password
sklearn's TfidfVectorizer(ngram_range=(1, 2)) is the most
common production setting for short text: unigrams capture topic words; bigrams
capture short phrases without exploding dimensionality the way trigrams on long
documents can. Character n-grams (3–5 characters) help with typos, product
SKUs, and multilingual code-switching where word tokenizers break.
Vocabulary limits and pruning
Real corpora produce millions of unique tokens. Practical pipelines cap the
vocabulary with max_features (keep top-N by corpus term frequency)
or min_df / max_df thresholds (drop terms appearing in
fewer than 2 documents or in more than 95% of documents). stop_words
removes language-specific function words; for
text classification
on short messages, removing “the” and “is” often helps
TF-IDF, but for sentiment or negation-heavy tasks keep stop words and rely on
n-grams instead.
L2 normalization
sklearn applies L2 normalization to each document vector by
default (norm='l2'), scaling every document to unit length. That
makes TF-IDF scores comparable across short and long documents and aligns the
geometry with
cosine similarity
— the default metric for nearest-neighbor retrieval on sparse vectors.
sklearn TfidfVectorizer in practice
The TfidfVectorizer class in scikit-learn wraps tokenization,
vocabulary building, and matrix construction into one estimator that fits into
Pipeline objects:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
clf = Pipeline([
("tfidf", TfidfVectorizer(
ngram_range=(1, 2),
min_df=2,
max_df=0.95,
sublinear_tf=True,
strip_accents="unicode",
)),
("lr", LogisticRegression(max_iter=1000, class_weight="balanced")),
])
Key parameters teams tune in production:
analyzer—'word'(default),'char', or'char_wb'for character n-grams bounded to word edges.token_pattern— regex for word tokens; widen it to keep hyphenated product codes.dtype—float32halves memory on large corpora with negligible accuracy loss.vocabulary— pass a fixed dict at inference so training and serving use identical feature indices.
The output is a scipy sparse matrix (CSR format) — store
the fitted vectorizer alongside the classifier with joblib and
version both artifacts. Never refit IDF statistics on live traffic without a
controlled retrain; drifting vocabulary shifts score distributions silently.
Worked example: Harbor Support ticket router
Harbor Support exported 38,000 labeled tickets (median length 42 words). The team's goal: auto-route 70% of incoming volume with 95%+ precision per queue, escalating uncertain tickets to humans.
Preprocessing choices
- Lowercased text; kept punctuation in error codes like
ERR-4021. ngram_range=(1, 2),min_df=3,max_df=0.92— dropped terms in fewer than 3 tickets or in more than 92% of tickets.- English
stop_wordsremoved; bigrams liketwo factorstill captured 2FA intent. class_weight='balanced'on logistic regression because the general-inquiry class was 2.3x larger than billing.
Results and interpretability
Held-out macro-F1: 91.2% (transformer fine-tune on DistilBERT
reached 93.1% but needed GPU training and 12x slower inference). Top weighted
features per class were inspectable: billing leaned on invoice,
charge, refund; technical on error,
crash, api timeout; account access on
password, login, two factor. Misroutes
clustered on multi-intent messages (“can't log in to pay my
invoice”) — fixed by adding a secondary rule: if top two class
probabilities are within 0.15, escalate to human review regardless of argmax.
Deployment pattern
The vectorizer vocabulary froze at 24,800 features. Inference: transform incoming ticket text to a sparse row, multiply by the coefficient matrix, softmax for probabilities. p99 latency under 8 ms on a single core — suitable for synchronous webhook routing at 400 tickets per minute.
TF-IDF vs dense embeddings
| Scenario | TF-IDF + linear model | Better alternative |
|---|---|---|
| Short labeled text (support, spam, tags) | Fast baseline, interpretable weights | Fine-tuned transformer if budget allows |
| Semantic similarity (“car” vs “automobile”) | No shared dimensions — treats as unrelated | Word embeddings or LLM embeddings |
| Negation and long syntax | Weak unless bigrams capture phrases | Transformer encoders |
| Keyword-heavy retrieval on millions of docs | BM25 / TF-IDF with inverted indexes | Semantic search for paraphrase |
| Tiny labeled set (< 200 per class) | Overfits rare n-grams | Zero-shot LLM or transfer learning |
| Multilingual without translation | Separate vocab per language | Multilingual sentence encoders |
Many production systems use hybrid retrieval: TF-IDF or BM25 for exact keyword hits, dense vectors for paraphrase — then merge ranked lists. Do not retire TF-IDF because transformers exist; benchmark both on your corpus and latency budget.
Common pitfalls
- Data leakage in IDF — fitting
TfidfVectorizeron the full dataset including test rows inflates scores. Fit only on training folds inside cross-validation pipelines. - Vocabulary drift — new product names unseen at train time get silently dropped at inference. Monitor out-of-vocabulary rate and schedule retrains.
- Trigram explosion —
ngram_range=(1, 3)on long documents creates millions of features and overfits. Cap withmax_features. - Ignoring class imbalance — raw accuracy hides majority-class bias. Use
class_weight, macro-F1, or stratified sampling. - Comparing TF-IDF to transformers unfairly — tune the linear baseline (n-grams,
min_df, regularization) before declaring transformers necessary. - Storing dense matrices — calling
.toarray()on a million-row corpus blows RAM. Keep CSR sparse format end-to-end.
Production checklist
- Establish a TF-IDF + linear baseline before GPU fine-tuning.
- Fit vectorizer inside
Pipelinewith cross-validation to prevent leakage. - Log vocabulary size, sparsity, and top features per class for auditability.
- Freeze vocabulary and IDF weights; version artifacts with training date.
- Set confidence thresholds for human escalation on low-margin predictions.
- Track macro-F1 and per-class recall on a weekly labeled sample.
- Plan hybrid upgrade path to embeddings when paraphrase errors dominate error analysis.
- Benchmark p95 inference latency under expected peak QPS.
Key takeaways
- TF-IDF weights word importance — frequent-in-document, rare-in-corpus terms score highest.
- Sparsity is a feature — linear models on TF-IDF are fast, cheap, and interpretable.
- N-grams recover short phrases — bigrams are the usual sweet spot for tickets and reviews.
- It is a baseline, not a ceiling — beat it fairly before jumping to transformers.
- Hybrid stacks win retrieval — combine sparse keyword signals with dense semantic search.
Related reading
- Text classification explained — single-label vs multi-label pipelines, evaluation, and when classical models still win
- NLP fundamentals explained — tokenization, parsing, and the full text-to-numbers pipeline
- Word embeddings explained — dense vectors, Word2Vec, and the path beyond sparse counts
- Cosine similarity explained — comparing sparse and dense vectors for search and recommendations