Guide
Text classification explained
Every inbox, moderation queue, and analytics dashboard eventually asks the same question: what kind of text is this? Text classification assigns one or more predefined labels to a document — spam vs ham, billing vs technical support, news topic, legal clause type, or toxicity tier. It is the workhorse of natural language processing: simpler than generation, cheaper than human review at scale, and the foundation for routing, search filters, and downstream models. Sentiment analysis is a specialized classification task (polarity labels); named entity recognition is sequence labeling, not whole-document categories. This guide covers label design, feature and model choices from rules to transformers, handling class imbalance, evaluation that matches business cost, a Harbor support routing worked example, an approach decision table, pitfalls, and a production checklist.
Problem shapes and label design
Before choosing an algorithm, nail the taxonomy. Bad labels produce models that look accurate on paper and fail in the ticket queue.
Single-label vs multi-label
Single-label (multiclass) assigns exactly one category per document: "refund request" OR "account lockout" OR "feature question." Use softmax output with mutually exclusive classes. Multi-label allows several tags simultaneously: a review can be both "shipping delay" and "damaged item." Train independent sigmoid heads per label or use binary relevance (one classifier per label).
Hierarchical and flat taxonomies
Flat taxonomies list peer categories ("Sports", "Politics", "Technology"). Hierarchical trees route coarse-to-fine: L1 "Billing" then L2 "Invoice dispute" vs "Payment failed." Hierarchies reduce confusion when classes overlap semantically but require more labeled data at each level. Start flat with 5–15 well-separated buckets; split only when error analysis shows systematic confusion between siblings.
Label quality beats model complexity
Write a rubric with positive and negative examples per class. Measure inter-annotator agreement (Cohen's κ or Krippendorff's α). If humans disagree 30% of the time, no classifier will reliably beat 70% accuracy — fix definitions first. Reserve a locked gold test set that never touches hyperparameter tuning; otherwise you overfit to annotation noise.
Approaches from rules to large language models
Rule-based and keyword systems
Regular expressions, keyword lists, and boolean logic ship in hours. A billing classifier might flag messages containing "invoice", "charge", or "refund" unless negated by "not about billing." Rules are explainable, deterministic, and free at inference — ideal for regulated domains and high-precision gates ("if contains SSN pattern, route to secure queue"). They degrade on paraphrase and multilingual input. Hybrid pattern: rules handle obvious cases; ML catches the long tail.
Classical machine learning
Tokenize text, build sparse features (bag-of-words, character n-grams, TF-IDF), and train linear models — logistic regression, linear SVM, or naive Bayes. These pipelines train in seconds on CPU, interpret via top-weighted terms, and often reach 85–92% accuracy on medium-sized corpora with clean labels. They struggle with word order ("not good" vs "good"), long documents beyond context windows, and cross-language generalization without translation.
Neural encoders and transformers
Fine-tune pretrained encoders (BERT, RoBERTa, DistilBERT) by adding a
classification head on the [CLS] token or mean-pooled token
embeddings. Transformers capture context, handle negation, and transfer across
domains with hundreds to a few thousand labels per class. Cost: GPU training,
slower inference, larger artifacts. Distilled models (TinyBERT, MobileBERT) trade
a few points of F1 for 5–10x speedup — often the right production compromise.
LLM zero-shot and few-shot classification
Large language models classify via prompting: "Choose one label from [A, B, C] for this message." Zero-shot needs no training data; quality depends on prompt clarity and model capability. Few-shot examples in the prompt improve edge cases. LLMs excel when taxonomy changes weekly or labeled data is scarce. Downsides: latency, token cost, inconsistent JSON formatting, and harder audit trails. For high-volume routing at cents per thousand messages, fine-tuned small models usually win on unit economics.
Features, preprocessing, and representation
Text is messy raw input. Standard preprocessing steps:
- Normalization — lowercase (language-dependent), Unicode NFC normalization, strip control characters, expand contractions if your corpus uses them consistently.
- Tokenization — word splits for classical ML; subword (WordPiece, BPE) for transformers. Do not stem aggressively for neural models; subwords already handle morphology.
- Stop words — removing "the" and "is" helps TF-IDF on short news headlines; hurts transformer fine-tuning which learns function words matter.
- Document length — truncate or chunk long PDFs. For transformers, head+tail concatenation (first 256 + last 256 tokens) preserves intro and conclusion signals on 10-page contracts.
- Metadata features — sender domain, channel (email vs chat), attachment flag, user tenure. Concatenate as numeric features to linear models or as extra embeddings in neural pipelines.
For imbalanced routing, consider class-weighted loss or oversampling minority classes — see our class imbalance guide for when SMOTE helps vs hurts text data.
Evaluation metrics that match the job
Accuracy misleads when 95% of tickets are "general inquiry" and 5% are "security incident." Pick metrics aligned with mistake cost:
- Precision — of predicted positives, how many are correct? High precision matters when false alarms waste senior engineer time.
- Recall — of true positives, how many did we catch? High recall matters for abuse, fraud, and safety categories you cannot miss.
- F1 — harmonic mean when you need balance; macro-F1 averages per-class F1 equally (fair to rare classes); micro-F1 weights by support (dominated by frequent classes).
- ROC-AUC and PR-AUC — threshold-independent ranking quality; PR-AUC is more informative under heavy imbalance.
- Calibration — predicted probabilities should match observed frequencies if you route by confidence tiers ("auto-close if p > 0.95").
Report per-class confusion matrices, not just aggregate scores. Error analysis — reading 50 misclassified examples — beats tweaking hyperparameters blindly.
Worked example: Harbor support ticket routing
Harbor Commerce runs a shared inbox for 40,000 monthly messages across billing, shipping, returns, and product questions. Human triage averaged 4.2 hours to first response. Goal: auto-route 70% of tickets with 92%+ precision so specialists see only ambiguous or high-risk cases.
Phase 1 — baseline
Team exported 18 months of labeled tickets (42,000 rows, four classes). A TF-IDF + logistic regression model trained in scikit-learn reached 88.4% accuracy and 0.86 macro-F1 in five minutes on a laptop. Billing and shipping were easy; "returns" confused with "billing" when customers mentioned "refund" for defective goods.
Phase 2 — disambiguation rules + transformer
Product added 12 hand-written rules for order-ID patterns and "where is my package" phrases — catching 22% of volume at 99.1% precision. Remaining traffic went to a fine-tuned DistilBERT (six epochs, 8k labeled examples). Combined pipeline: rules first, model second, confidence threshold 0.82 else "manual review" bucket.
Results and monitoring
Live routing handled 71% of tickets automatically; precision on auto-routed traffic was 93.6%. Weekly drift checks compared predicted label distribution to historical baselines — a spike in "billing" often preceded a payment processor outage. Misroutes fed an active learning queue for re-labeling and quarterly retraining.
Approach decision table
| Your situation | Reasonable starting approach | Watch for |
|---|---|---|
| < 500 labels total, taxonomy stable | Keyword rules + LLM few-shot with human review | Prompt drift when model version changes |
| 5k–50k labels, latency < 50 ms, CPU only | TF-IDF + linear SVM or logistic regression | Long documents; add chunking or summaries |
| 1k+ per class, accuracy is bottleneck | Fine-tuned DistilBERT or similar small transformer | GPU cost; distill or quantize for edge deploy |
| Multi-label, 20+ tags, sparse positives | Binary relevance per label + per-label thresholds | Macro-F1 collapse; tune thresholds per class |
| Taxonomy changes every sprint | LLM zero-shot with structured JSON output | Token spend; cache embeddings if texts repeat |
| Regulated audit trail required | Rules + interpretable linear model with logged features | Neural "black box" may fail compliance review |
Production concerns
A notebook F1 score is not a product. Before launch:
- Version everything — training data snapshot, tokenizer, model weights, label schema version, and inference code hash.
- Shadow mode — run the new model alongside humans without acting on predictions until precision stabilizes for two weeks.
- Fallback path — low-confidence predictions route to human review, not the wrong team.
- Latency SLOs — batch overnight jobs tolerate seconds; chat widgets need sub-200 ms p99.
- PII handling — redact emails and card numbers before logging misclassified examples for retraining.
- Multilingual — train per language, use multilingual encoders (XLM-R, mBERT), or detect language upstream and branch pipelines.
Track concept drift when product launches new features that change how customers write ("staking" meant nothing until you shipped a wallet).
Common pitfalls
- Leaking labels through metadata — if every "urgent" ticket came from one form ID, the model learns form ID, not text. Hold out entire forms or time periods in validation.
- Training on auto-labeled data — using old rule outputs as ground truth teaches the model to copy buggy rules.
- Optimizing accuracy on imbalanced data — a majority-class classifier looks great while missing every fraud case.
- Ignoring label hierarchy collisions — "cancel order" spans billing and shipping; ambiguous examples need multi-label or a dedicated "cross-team" class.
- Deploying without confidence thresholds — forcing a guess on every input creates silent misroutes worse than admitting uncertainty.
- Skipping adversarial review — users game classifiers ("this is not spam" in subject lines); red-team before abuse-heavy launches.
Practitioner checklist
- Define 5–15 mutually understandable labels with written rubric and examples.
- Measure human agreement on a 200-example pilot before scaling annotation.
- Split train/validation/test by time or user, not random rows, when drift exists.
- Ship TF-IDF + linear baseline before investing in GPU fine-tuning.
- Report per-class precision, recall, and confusion matrix on locked test set.
- Set per-class confidence thresholds from validation, not default 0.5.
- Log model version, input hash, prediction, and score on every inference.
- Run shadow deployment; compare auto-route rate and human override rate weekly.
- Queue low-confidence and human-corrected examples for active learning.
- Re-evaluate when taxonomy, product surface, or upstream LLM version changes.
Key takeaways
- Text classification maps documents to predefined labels — the backbone of routing, moderation, and analytics in NLP pipelines.
- Label design and rubric quality matter more than model architecture up to a point.
- Classical TF-IDF + linear models remain strong baselines; transformers win on nuance; LLMs win on flexibility.
- Metrics must reflect business cost of false positives vs false negatives — rarely is accuracy enough.
- Production requires thresholds, drift monitoring, shadow launches, and a human fallback path.
Related reading
- Natural language processing fundamentals explained — preprocessing, tasks, and pipeline architecture
- Sentiment analysis explained — polarity classification with lexicons and transformers
- Named entity recognition explained — span-level labeling vs document categories
- Few-shot learning explained — classify with minimal labeled examples per class