Guide
Sentiment analysis explained
A product manager opens the reviews dashboard and sees that overall star ratings climbed this quarter — but support tickets about "battery life" spiked negative. Sentiment analysis is the branch of natural language processing that assigns an opinion label or score to text: positive, negative, neutral, or finer-grained scales, often per aspect ("shipping fast" vs "packaging flimsy"). It powers review aggregators, brand monitoring, ticket routing, trading signals, and moderation queues. Unlike named entity recognition, which finds what is mentioned, sentiment analysis answers how the author feels about it. This guide covers lexicon baselines, classical and transformer classifiers, aspect-based sentiment (ABSA), LLM prompting vs fine-tuning, evaluation with precision and recall, a product-review pipeline worked example, an approach decision table, common pitfalls, and a production checklist.
What sentiment analysis measures
At its simplest, sentiment analysis maps a text unit — a sentence, review, tweet, or document — to a polarity: positive, negative, or neutral. Many production systems add a continuous score from −1.0 (strongly negative) to +1.0 (strongly positive), which is easier to aggregate over time ("average sentiment this week") than hard labels.
Three design choices shape every project:
- Granularity: document-level ("this review is positive"), sentence-level (mixed reviews), or aspect-level (positive about price, negative about durability).
- Label schema: binary, ternary (+ neutral), five-star ordinal, or emotion categories (joy, anger, fear).
- Domain: general social text vs vertical jargon — "sick drop" is praise in gaming slang and concern in healthcare.
Sentiment is subjective. Two human annotators agree only ~70–85% on fine-grained labels without a rubric. Treat gold labels as noisy; measure inter-annotator agreement (Cohen's κ) before trusting a single F1 number.
Lexicon and rule baselines
The fastest baseline sums sentiment weights from a curated lexicon. Tools like VADER (Valence Aware Dictionary and sEntiment Reasoner) handle negation ("not good"), intensifiers ("very"), and punctuation ("!!!") with hand-tuned rules. AFINN and domain word lists (finance: "beat/miss", gaming: "nerf/buff") extend the idea.
Lexicons excel when you need explainability, zero training data, and sub-millisecond latency on millions of short posts. They fail on sarcasm, context-dependent polarity ("long battery life" vs "long wait time"), multilingual code-switching, and product-specific slang you have not encoded.
Practical pattern: ship a lexicon score as sentiment_lexicon feature
alongside learned models — it often boosts robustness on out-of-vocabulary
intensifiers.
Machine learning classifiers
Bag-of-words and linear models
Classical pipelines tokenize text, build TF-IDF or n-gram features, and train logistic regression or linear SVMs. On 10k–100k labeled reviews they remain strong baselines: fast to train, cheap to serve, and interpretable via top weighted n-grams.
Neural and transformer models
Fine-tuning DistilBERT, RoBERTa, or domain models (FinBERT for finance) on your labels captures word order, negation scope, and comparative constructions ("better than the old model"). Expect higher accuracy at the cost of GPU inference and slower cold starts. For latency-sensitive paths, distill the transformer into a smaller student or export to ONNX.
Large language models
Prompting an LLM with few-shot examples ("Classify sentiment as positive, negative, or neutral. Text: …") works for exploratory dashboards and low-volume queues. Pair outputs with structured JSON schemas to avoid free-text parsing. For high-volume, cost-sensitive pipelines, fine-tuning a small classifier or using a dedicated sentiment API usually beats per-row GPT calls. Reserve LLMs for ABSA extraction where you need open-vocabulary aspect names without training data.
Aspect-based sentiment analysis (ABSA)
Document-level polarity hides mixed opinions. ABSA extracts (aspect, sentiment)
pairs: from "Great camera but the UI is confusing," you want
(camera, positive) and (UI, negative).
Common architectures:
- Pipeline: run NER or keyword rules for aspects, then classify sentiment on each span — simple but error cascades.
- Joint tagging: BIO tags like
B-ASP.posmark aspect spans with polarity in one pass. - LLM extraction: prompt for JSON list of aspects and scores; validate against schema; cache aspect taxonomy for consistency.
Define a controlled aspect ontology (price, quality, support, shipping) for dashboards; allow an "other" bucket for discovery. Reconcile new aspects monthly rather than letting labels drift unbounded.
Evaluation metrics that match the product
Accuracy misleads on imbalanced corpora (90% positive reviews → a dummy "always positive" model scores 90%). Use per-class precision, recall, and F1; for ranking usecases, macro-F1 treats rare negative alerts equally with common neutrals.
For ordinal star ratings, quadratic weighted kappa penalizes confusing 1-star with 5-star more than 4 with 5. For continuous scores, report Pearson/Spearman correlation with human means and calibration plots (predicted score vs observed satisfaction).
Slice metrics by language, channel (Twitter vs email), and text length. A model trained on formal reviews often collapses on emoji-heavy chat unless you augment with domain-appropriate augmentation.
Worked example: e-commerce review pipeline
A marketplace ingests ~50k new product reviews per day across electronics and apparel.
- Ingest & dedupe: hash review body + SKU; drop duplicates and bot patterns (repeated template, burst from one IP).
- Language detect: route EN/ES/DE to language-specific models; fallback to multilingual XLM-RoBERTa for long tail.
- Document sentiment: DistilBERT fine-tuned on 30k labeled reviews → ternary label + score; latency budget 40 ms on CPU batch-32.
- ABSA for electronics: weekly LLM job on a 2% sample extracts new aspect phrases; human curator merges into ontology; joint tagger fine-tuned on 5k span labels serves online.
- Aggregation: nightly rollups per SKU: mean score, aspect heatmap, alert if "defect" aspect negative rate > 2× trailing 30-day baseline.
- Human loop: route 1-star + high model uncertainty to moderators; corrections feed next month's retrain.
Lexicon VADER runs in parallel as a drift check — if VADER and BERT disagree on >25% of a slice, trigger data quality review (often template change or new meme slang).
Approach decision table
| Scenario | Recommended approach | Avoid |
|---|---|---|
| No labels, need same-day dashboard | VADER or domain lexicon + manual spot checks | Claiming 95% accuracy without evaluation |
| 10k+ labels, batch analytics | TF-IDF + logistic regression or fine-tuned DistilBERT | Full GPT-4 per row at scale |
| Mixed opinions per review | Sentence split + ABSA or joint tagging | Document-level label only |
| Real-time moderation (<50 ms) | Linear model or tiny ONNX transformer on GPU | Multi-hop LLM chains |
| New product line, zero history | Few-shot LLM + active learning to label disagreements | Transfer model from unrelated domain without slice eval |
Common pitfalls
- Ignoring neutral class: forcing binary labels inflates apparent accuracy; many support tickets are factual, not emotional.
- Train/test leakage: duplicate reviews across splits or including metadata (star rating) the model will not see at inference.
- Star rating as free label: 3-star reviews are often mixed sentiment; align annotation rubric with stars or drop ambiguous rows.
- Sarcasm overfitting: small sarcasm sets do not generalize; flag low-confidence instead of guessing.
- Concept drift: product launches and memes shift language; monitor score distributions and retrain on a schedule.
- Demographic bias: dialect and cultural expression styles scored as "negative tone" — audit slices and adjust training data.
Practitioner checklist
- Define granularity (document, sentence, aspect) and label rubric with examples.
- Measure inter-annotator agreement before scaling labeling.
- Ship a lexicon baseline for explainability and drift comparison.
- Evaluate with macro-F1 or per-class metrics, not accuracy alone.
- Slice test sets by language, channel, and product category.
- Version models and training corpora; log model ID on every prediction.
- Expose confidence or entropy; route uncertain cases to humans.
- For ABSA, maintain a curated aspect ontology with an "other" escape hatch.
- Monitor input length and OOV rate; cap tokens and truncate consistently.
- Revisit labels quarterly — sentiment language evolves faster than entities.
Key takeaways
- Sentiment analysis assigns opinion polarity or scores to text — distinct from entity extraction but often combined in analytics pipelines.
- Lexicons are fast baselines; fine-tuned transformers win on nuanced phrasing when you have labeled data.
- ABSA surfaces actionable drivers (which feature angered users) that document-level scores hide.
- Choose metrics and slices that match downstream actions — moderation cares about recall on toxic negatives; dashboards care about calibrated trends.
- LLMs help bootstrap aspects and rubrics; specialized classifiers sustain cost and latency at scale.
Related reading
- Natural language processing fundamentals — preprocessing, embeddings, and task taxonomy
- Named entity recognition explained — extracting structured spans for aspect pipelines
- Precision, recall and F1 explained — choosing metrics for imbalanced sentiment classes
- LLM fine-tuning explained — when to specialize a model vs prompt a general one