Guide

LLM chain-of-density summarization explained

Harbor Media's automated earnings briefs launched with a single abstractive pass: “Summarize this 10-K filing in 80 words.” Copy editors rated fluency 4.2/5 but entity recall 2.1/5 — CEO names, segment revenue figures, and guidance ranges kept vanishing behind generic phrases like “strong performance in key markets.” Lengthening the word limit helped marginally; asking the model to “include all numbers” produced unreadable lists. The newsroom nearly killed the pipeline after a wire picked up a brief that omitted a $400M impairment charge buried on page 47.

The rebuild adopted chain-of-density (CoD) prompting: five iterative rewrites at a fixed word count, each pass instructed to add 1–3 missing entities without growing length. Editors still review, but first-draft entity recall rose from 41% to 87% on a 200-filing eval set while word count stayed locked at 80. This guide covers what CoD is, how it differs from one-shot and map-reduce summarization, the sparse-to-dense iteration loop, entity scoring and stopping rules, integration with context compression pipelines, the Harbor Media refactor, a technique decision table versus standard abstractive summaries, pitfalls, and a production checklist.

What chain-of-density summarization is

Chain-of-density (CoD) is a prompting pattern where an LLM produces a sequence of summaries of the same source text at an identical length budget. Each iteration must become denser: more named entities, numbers, dates, and causal links per word, without increasing the word count.

Adams, Fabbri, and colleagues introduced the method in 2023. Human raters preferred CoD summaries over vanilla abstractive outputs for informativeness while judging them equally fluent. The insight: models default to vague abstractions when asked once; forcing repeated compression with explicit “what is still missing?” steps surfaces facts the first pass skipped.

CoD is not a training technique. Weights stay frozen. You change the inference recipe — typically 3–5 chained calls or a single multi-turn thread where each assistant message becomes the prior summary input.

The sparse-to-dense iteration loop

A production CoD loop has four mechanical steps per density level:

Identify missing entities — prompt the model to list 1–3 salient entities (people, orgs, metrics, events) present in the source but absent from the current summary.
Fuse entities — rewrite the summary at the same word count, weaving in the missing items while dropping the least informative phrases from the prior draft.
Verify length — reject outputs that exceed the budget; trim or re-prompt with “shorter by N words.”
Score faithfulness — optional automated check that each new entity appears in source spans (string match, NER alignment, or entailment classifier).

Level 1 is intentionally sparse — a readable sketch. Level 5 is dense — wire-service style. Teams often ship level 3–4 to humans; level 5 for push notifications where every word must carry a fact.

Prompt skeleton

Article: {{source_text}}

Summary ({{word_count}} words): {{current_summary}}

Missing entities (1-3, from article only): ...

Denser summary (exactly {{word_count}} words, include missing entities,
drop vague phrases):

Keep the source article in context for every iteration. CoD quality collapses when later passes only see the shrinking summary, not the original — the model invents entities to satisfy the “add missing” instruction.

CoD vs other summarization patterns

Pattern	Mechanism	Strength	Weakness
One-shot abstractive	Single “summarize in N words” call	Cheapest, lowest latency	Vague, drops rare entities
Extractive (TextRank, lead-3)	Select sentences verbatim	Faithful wording	Choppy, no synthesis across sections
Map-reduce	Summarize chunks, merge summaries	Book-length corpora	Merge step still loses cross-chunk entities
Chain-of-density	Fixed-length iterative densification	Entity-rich briefs at tight word caps	3–5× token cost vs one-shot
Refine / rolling summary	Each chunk updates running summary	Streaming long docs	Early chunks over-weighted unless re-balanced

CoD shines when the output format is rigid (push alert, SEO dek, earnings bullet) and missing a proper noun is worse than slightly awkward prose. For holistic narrative summaries at 500+ words, map-reduce or hierarchical abstractive pipelines often fit better.

Entity scoring and stopping rules

Blindly running five CoD levels wastes tokens once entity recall plateaus. Instrument each iteration:

Entity recall@source — fraction of gold entities (human-labeled or regex-extracted tickers, $ amounts, executive titles) present in the summary.
Novel entity rate — entities added at level k that were not in level k−1. Stop when this falls below a threshold (e.g. <0.5 new entities per pass).
Hallucination rate — entities in summary with no support span in source. If this rises, reduce iterations or add retrieval grounding.
Readability — Flesch or editor rubric; density should not collapse into noun stacks.

Harbor Media stops at level 4 when recall@source ≥ 85% or novel entity rate < 0.3, whichever comes first. Level 5 runs only for filings flagged “high risk” by a separate classifier (restructuring, impairment, guidance cuts).

Harbor Media earnings brief refactor

Before CoD, the pipeline chunked 10-K HTML into 2K-token segments, map-summarized each, then asked for one 80-word abstract. Impairments in footnotes often lived in low-salience chunks and never reached the merge prompt.

After refactor:

Structured pre-pass — extract tables (revenue by segment, guidance ranges, impairment lines) into JSON sidecars passed alongside prose.
CoD on full text + JSON — five levels at 80 words; JSON entities listed as “candidate missing entities” when the model skips them.
Faithfulness gate — every $ figure in the summary must fuzzy-match a figure in source or JSON; otherwise roll back one level and re-prompt with the orphan highlighted.
Editor UI — shows levels 1–5 side by side; default publish level 4; diff highlights entities added per level.

Median API cost per brief rose $0.04 → $0.19 (five calls on a mid-tier model). Editor touch time fell 6.2 min → 2.1 min; wire-ready without rewrite hit 78% vs 31%. The impairment omission class of errors dropped to zero over eight weeks of monitored publishes.

Combining CoD with compression and RAG

CoD does not replace retrieval or context window management — it refines the output once relevant text is in context. Typical stack:

Retrieve top-k sections (earnings release, risk factors, MD&A).
Optionally compress boilerplate tables into JSON stubs.
Run CoD on the bundled context with source pinned in every iteration.
Place final summary near the end of any downstream Q&A prompt to combat lost-in-the-middle effects when the summary is reused as context.

Cache CoD level-5 outputs keyed by document hash. Re-run only when the filing version changes.

Technique decision table

Approach	Best when	Skip when
One-shot abstractive	Internal notes, low-stakes digests	Wire copy, compliance-facing briefs
Extractive selection	Legal quotes, code snippets	Multi-section synthesis required
Map-reduce	100+ page reports, books	Sub-5-page docs with tight word caps
Chain-of-density	Fixed-length entity-rich briefs (80–120 words)	Long-form narrative summaries
Human-only	Material disclosures, regulated advice	High-volume low-risk content

Common pitfalls

Iterations without source — later passes hallucinate entities; always attach the original article.
Soft word limits — “about 80 words” drifts to 110; enforce with tokenizer counts and hard rejection.
Unbounded levels — level 6+ often degrades into keyword soup; cap at 5 and use stopping rules.
Ignoring merge-stage loss — CoD on chunk summaries still misses cross-chunk facts; run CoD on retrieved full sections or JSON sidecars.
Same model for identify and write — a smaller model can list missing entities; reserve the large model for the rewrite step to save cost.
No hallucination check — density pressure invents plausible tickers; require source span alignment for numbers and names.
Skipping human review on material news — CoD improves recall, not legal liability; keep editors in the loop for market-moving filings.

Production checklist

Define fixed word or token budget per output channel (push, web dek, email).
Pin full source text (and structured extracts) in every CoD iteration.
Prompt for 1–3 missing entities explicitly before each denser rewrite.
Enforce exact length with programmatic trim and re-prompt fallback.
Log entity recall, novel-entity rate, and hallucination rate per level.
Stop early when recall plateaus or novel entities drop below threshold.
Run faithfulness gates on dollar amounts, dates, and executive names.
Cache final summaries by document hash; invalidate on source update.
Expose level 1–5 drafts to editors with per-level entity diffs.
Pair CoD with retrieval for long filings; do not CoD entire 10-K in one window.
A/B test publish level (3 vs 4 vs 5) on editor time and correction rate.
Document which density level ships to each downstream surface.

Key takeaways

Chain-of-density fixes the vagueness of one-shot summaries by iterating at a fixed length, adding entities each pass.
Keep the source document in context for every iteration — summarizing the summary alone invites hallucination.
Harbor Media raised entity recall from 41% to 87% at 80 words while cutting editor touch time by two thirds.
Use stopping rules and faithfulness gates; density without verification is worse than a sparse accurate brief.
CoD complements retrieval and compression — it polishes the output layer, not the ingest layer.