Guide

Text summarization explained

Text summarization compresses a long document into a shorter version that preserves the most important information. News digests, earnings-call briefs, support-ticket triage, and pre-processing for RAG pipelines all depend on reliable summarization. The field splits into two families: extractive methods that select existing sentences from the source, and abstractive methods that generate new wording — including modern transformer models and frontier LLMs. This guide covers both paradigms, evaluation with ROUGE, handling documents longer than a model’s context window, LLM prompting patterns, faithfulness and hallucination risks, a worked news-article example, a method decision table, common pitfalls, and a production checklist.

Extractive vs abstractive summarization

Extractive summarization treats the problem as selection: score each sentence (or passage) in the source document, pick the top k units, and concatenate them in original order. Nothing is paraphrased, so extractive output is inherently faithful to the source — but it can read choppily and miss ideas spread across multiple sentences.

Abstractive summarization generates new text, like a human editor rewriting a wire story into a headline and two paragraphs. Seq2seq models and LLMs excel here: they produce fluent prose and can synthesize information from distant parts of a document. The trade-off is hallucination — invented facts, wrong numbers, or attributions that never appeared in the source.

Hybrid pipelines are common in production: extractive pre-filtering shrinks a 50-page PDF to the ten most relevant paragraphs, then an abstractive model writes a three-sentence brief. This bounds cost and reduces the surface area for fabrication.

Extractive methods and baselines

Before reaching for a billion-parameter model, establish baselines — they are fast, cheap, and surprisingly competitive on news and corporate filings.

Lead baseline

For inverted-pyramid news articles, the first n sentences often are the summary. A lead-3 baseline (first three sentences) is the benchmark every serious system must beat on datasets like CNN/DailyMail.

Sentence scoring

Score sentences by TF-IDF centrality, BM25 relevance to the document title, or position weighting (earlier sentences score higher). Combine scores with redundancy penalties so selected sentences do not repeat the same fact.

Graph-based: TextRank

TextRank (an adaptation of PageRank) builds a graph where sentences are nodes and edges represent lexical similarity. High-centrality sentences are selected. TextRank needs no training data and works across domains, though it struggles with highly technical prose where similarity metrics miss semantic overlap.

Neural extractive models

BERT-style encoders score each sentence in context of the full document (or a sliding window for long inputs). Fine-tuned models like BERTSUM outperform TextRank on standard benchmarks but require labeled summary data and GPU inference.

Abstractive methods: from seq2seq to LLMs

Early abstractive systems used LSTM encoder-decoder networks with attention. The transformer architecture replaced recurrence with self-attention, enabling parallel training on large summary corpora. Models like BART, PEGASUS, and T5 (fine-tuned on CNN/DailyMail or XSum) remain strong choices when you need on-prem inference at moderate cost.

LLM prompting

Frontier models summarize via natural-language instructions: “Summarize the following article in three bullet points. Include only facts stated in the text. Quote any numbers exactly.” Techniques that improve quality:

Chain-of-thought staging — first list key entities and claims, then write the summary from that scratchpad (reduces skipped facts).
Length and format constraints — specify word count, bullet vs paragraph, and audience (“for a busy portfolio manager”).
Map-reduce for long docs — summarize each chunk independently, then summarize the chunk summaries (watch for detail loss at the merge step).
Refine loops — ask the model to check its summary against the source and remove unsupported claims.

Fine-tuning vs prompting

Fine-tune smaller models when you have thousands of domain-specific (source, summary) pairs — legal briefs, clinical notes, internal wiki pages. Prompt frontier LLMs when data is scarce, domains shift frequently, or you need zero-shot generalization across document types.

Evaluation: ROUGE, faithfulness, and human judgment

Automatic metrics compare system output to human-written reference summaries. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap:

ROUGE-1 — unigram overlap (word-level recall).
ROUGE-2 — bigram overlap (phrase-level fluency proxy).
ROUGE-L — longest common subsequence (captures sentence structure).

High ROUGE correlates with quality on news benchmarks but rewards copying and penalizes valid paraphrases. It says nothing about faithfulness — a summary can score well while inventing a revenue figure.

Complement ROUGE with:

Entity overlap — do names, dates, and numbers in the summary appear in the source?
NLI-based faithfulness scores — does the source entail each summary sentence?
Human rubrics — coherence, relevance, and factual consistency rated on a 1–5 scale (gold standard for high-stakes domains).

See LLM evaluation and benchmarking for broader metric suites and LLM hallucinations for mitigation strategies.

Long documents and context limits

A 10-K filing or clinical trial report exceeds most model context windows. Strategies:

Hierarchical summarization — section summaries roll up to a document summary; preserve section headers as structure hints.
Retrieval-augmented summarization — embed chunks, retrieve the most relevant passages for the user’s query, summarize only those (pairs naturally with semantic search).
Sliding windows with deduplication — overlap windows to avoid cutting mid-sentence, merge partial summaries, deduplicate repeated facts.
Long-context models — 100k+ token windows reduce chunking artifacts but increase cost and latency; still verify faithfulness on tail content models often under-weight.

Chunking strategy directly affects summary quality — see RAG chunking strategies for overlap, boundary, and metadata patterns that transfer to summarization pipelines.

Worked example: earnings news article

Source: A 600-word news article reporting that Acme Corp beat Q1 earnings estimates ($1.42 EPS vs $1.31 consensus), raised full-year guidance to 8–10% revenue growth, announced a $2B buyback, and noted CFO departure effective July 1.

Pipeline:

Split into 12 sentences; run TextRank — top sentences cover EPS beat, guidance raise, and buyback (CFO departure ranks lower due to position).
Pass top-5 sentences plus full text to an LLM with instruction: “Write a 60-word summary for investors. Include exact EPS and guidance figures. Mention CFO departure.”
Post-check: verify $1.42, $1.31, 8–10%, and $2B appear in source; flag if any number in summary lacks a source match.

Output: “Acme Corp reported Q1 EPS of $1.42, beating the $1.31 consensus. Management raised full-year revenue growth guidance to 8–10% and authorized a $2B share repurchase. CFO Jane Smith will depart effective July 1.”

Extractive-only output would have missed the CFO line; abstractive-only without entity checks might round $1.42 to “about $1.40” or invent a replacement name.

Method decision table

Approach	Best for	Faithfulness	Fluency	Cost / latency
Lead baseline / TextRank	News, quick prototypes	High	Low–medium	Very low
Fine-tuned BERTSUM / BART	High-volume, fixed domain	Medium–high	Medium	Low (GPU)
LLM zero-shot prompt	Mixed document types, fast iteration	Medium (verify!)	High	Medium–high
Hybrid extract + LLM	Long docs, compliance-sensitive	High with checks	High	Medium
Map-reduce LLM	Books, filings, transcripts	Medium (merge loss)	High	High

Common pitfalls

Optimizing ROUGE alone — models learn to copy longest sentences; add faithfulness metrics and human review for production.
Ignoring negation and qualifiers — “did not beat estimates” summarized as “beat estimates” is a catastrophic failure mode; NLI checks help.
Single-shot LLM on 100-page PDFs — tail content gets dropped; use hierarchical or map-reduce pipelines.
Wrong summary length — a three-sentence cap on a complex merger agreement omits material risks; tune length to use case.
Training/test leakage — news deduplication across splits inflates ROUGE; use document-level splits.
Skipping PII redaction — summarizing support tickets into a digest can leak customer data into a shared channel; redact before summarization.
No citation anchors — for regulated domains, link each summary claim to a source paragraph or page number.

Production checklist

Define summary format (bullets, paragraph, word cap) and audience per use case.
Establish baselines: lead-k and TextRank before deploying neural models.
Measure ROUGE-1/2/L and entity-level faithfulness on a held-out set.
For LLM summaries, use structured prompts with explicit “source only” constraints.
Post-process: verify numbers, dates, and named entities against the source text.
Log inputs, outputs, and model version for audit trails in regulated industries.
Rate-limit and cache summaries for identical documents to control API cost.
Human-review a random sample weekly; track coherence and factual error rate over time.
For multilingual sources, confirm the summarization model matches document language.
Document when not to summarize — raw text may be safer for legal discovery.

Key takeaways

Extractive methods select existing sentences — faithful but choppy; abstractive methods generate new prose — fluent but hallucination-prone.
ROUGE measures overlap with reference summaries; pair it with faithfulness checks for production.
Long documents need hierarchical, map-reduce, or retrieval-augmented pipelines — not a single prompt.
Hybrid extract-then-abstract balances cost, fluency, and factual grounding.
LLM summarization shines with clear format constraints and automated entity verification.