Guide
LLM PDF document parsing explained
Harbor Legal indexed 12,000 merger PDFs for a counsel-assist RAG bot by calling
pdftotext and embedding the resulting string. On a regression set of
200 questions over earn-out schedules and indemnity exhibits, recall@10 was 74% —
acceptable until attorneys noticed systematic failures on two-column contracts and
financial tables. A plain-text strip had merged FY2023 and FY2024 columns, so the
bot cited a $4.2M earn-out cap from the wrong fiscal year in three
live deals. Rebuilding ingest with layout-aware PDF parsing —
reading-order reconstruction, table cells as structured blocks, and OCR only where
text layers were missing — lifted table QA recall from 61% to 94% and overall
counsel checklist accuracy from 79% to 92% without changing the embedding model.
PDF is a presentation format, not a semantic document model. Pages store positioned glyphs, vector paths, and embedded images; there is no guaranteed paragraph hierarchy in the file bytes. Teams that paste “extracted text” into RAG or map-reduce pipelines inherit scrambled reading order, destroyed tables, and header/footer noise. This guide covers born-digital vs scanned PDF paths, layout analysis and reading order, table and figure handling, OCR integration with OCR, chunking for vector indexes, a Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.
Why naive PDF text extraction fails
Most PDF libraries expose a get_text() or equivalent that concatenates
text objects in file order — not human reading order. Failure modes that
look fine to the naked eye but break LLM pipelines:
- Multi-column layouts — legal briefs and SEC filings interleave left and right columns line-by-line, producing gibberish sentences when flattened.
- Tables as text soup — cell boundaries disappear; FY2023 revenue may appear adjacent to FY2024 EBITDA on the same line.
- Headers, footers, and page numbers — repeated every page, polluting embeddings and dominating retrieval for rare clause queries.
- Footnotes and margin notes — extracted mid-paragraph without anchors, confusing core vs supplemental obligations.
- Rotated and watermark text — included in the stream even when visually background-only.
- Scanned pages without a text layer — return empty strings unless OCR runs.
The fix is not a bigger embedding model; it is a parse stage that outputs structured blocks (title, paragraph, list item, table, figure caption) with bounding boxes, page numbers, and stable block IDs before chunking.
Born-digital vs scanned PDF pipelines
Route each uploaded PDF through a lightweight classifier before heavy processing:
Born-digital (selectable text)
Text objects exist in the PDF content stream. Use layout parsers (Unstructured, Docling, pdfplumber with heuristics, or commercial APIs) to recover blocks and reading order. Avoid rasterizing entire pages — OCR on clean digital PDFs adds latency and introduces character errors on small footnotes.
Scanned or hybrid (image-only pages)
Run page-level detection: if extracted character count per page falls below a threshold (e.g. <40 chars on a full letter page), treat as scan. Pipeline: 300 DPI render → deskew/denoise → OCR (Tesseract, PaddleOCR, or cloud Document AI) → layout model on OCR boxes. Hybrid SEC filings often mix digital body text with scanned signature pages; per-page routing is mandatory.
Password-protected and redacted files
Decrypt server-side with user-supplied passwords in ephemeral memory; never log
credentials. Redaction black boxes remove text objects — treat redacted
regions as explicit [REDACTED] tokens so the LLM does not invent
values in gaps.
Layout analysis and reading order
Modern parsers combine geometric heuristics with lightweight vision models:
- Extract text spans with font, size, and (x, y, width, height).
- Cluster into lines by vertical overlap tolerance.
- Detect columns via x-axis histograms or learned layout models.
- Sort blocks top-to-bottom within each column, then column left-to-right for Western documents.
- Classify block type — heading vs body via font size ratios and numbering patterns (1., 1.1, Article IV).
Emit a JSON document tree: { "page": 12, "type": "paragraph", "text": "...",
"bbox": [...], "section_path": ["Article 7", "7.2 Indemnity"] }.
Section paths can be inferred from heading hierarchy or from bookmark outlines
when authors maintained PDF table-of-contents entries. Store the tree in object
storage; downstream chunkers read blocks, not raw PDF bytes.
For multi-page paragraphs split across page breaks, merge blocks when the bottom line lacks terminal punctuation and the next block continues mid-sentence with lowercase. This prevents retrieval from returning half-clauses.
Table and figure extraction
Tables deserve first-class treatment — never flatten to prose during ingest.
Detection
Rule-based: ruled lines and aligned numeric columns. Model-based: TableTransformer,
layout parsers that label Table regions. Commercial extractors
(Azure Document Intelligence, AWS Textract, Google Document AI) return cell grids
with row/column indices.
Serialization for LLMs
Common formats, pick one per pipeline and stay consistent:
- Markdown tables — readable in prompts; fragile on merged cells.
- HTML
<table>— preserves colspan; good for financial exhibits. - JSON row arrays — best for programmatic QA and structured output validation against parsed cells.
Index tables as dedicated chunks with metadata
content_type=table, table_id, and caption text.
Queries like “FY2024 earn-out cap” should retrieve the table chunk,
not a neighboring indemnity paragraph. Figures and charts: store OCR’d
caption plus optional vision-model description; keep image bytes for UI citation
thumbnails, not for embedding raw pixels unless running a
multimodal RAG
path.
Tooling landscape
No single library wins every PDF. Typical production stacks mix:
- PyMuPDF (fitz) — fast page render and text span extraction; pair with custom reading-order heuristics.
- pdfplumber — strong table detection on ruled financial PDFs; weaker on complex legal multi-column without tuning.
- Unstructured.io — unified partition API, hi-res layout model option, outputs elements ready for chunking.
- Docling (IBM) — open layout model tuned for scientific and enterprise documents; exports structured JSON and Markdown.
- Cloud document APIs — highest accuracy on scans and handwriting; per-page cost matters at million-page scale.
Benchmark on your corpus: legal, medical, and engineering PDFs fail differently. Maintain a 50-document gold set with human-labeled reading order and table cell values; track character error rate (CER) on OCR pages and table cell exact-match rate before swapping libraries.
RAG ingest pipeline design
Recommended stages after parse:
- Normalize — Unicode NFC, de-hyphenate line breaks, strip repeated headers/footers matched by regex across pages.
- Enrich metadata —
doc_id, filename, upload timestamp, page range, section_path, content_type. - Chunk — respect block boundaries; tables atomic; prose chunks 512–1024 tokens with 10–15% overlap per chunking best practice.
- Optional contextual prefix — prepend section title to each chunk before embedding (Anthropic contextual retrieval pattern); see contextual retrieval.
- Embed and index — dense vectors plus BM25 on the same normalized text; hybrid search recovers exact clause numbers OCR slightly garbled.
- Version — hash parsed tree + parser version; re-embed only when parse logic changes, not on every query.
For exhaustive diligence (must read every page), parsed blocks feed map-reduce maps instead of vector top-k alone. Parsing quality caps downstream recall regardless of map prompt quality.
Harbor Legal contract-ingest refactor
Harbor Legal replaced the pdftotext one-liner with a staged pipeline:
- Partition via Unstructured hi-res on born-digital deals; Textract fallback on scan-detected pages.
- Table pass — pdfplumber second opinion on exhibits labeled Schedule or Exhibit; merge cell grids when both parsers agree within edit distance 2.
- Section tagging — regex on Article / Section headings
plus PDF outline bookmarks; propagate
section_pathto child blocks. - Header/footer removal — drop text blocks whose normalized string matched on >80% of pages (firm name, “Confidential”).
- Chunk + contextual embed — 768-token chunks with section prefix; hybrid retrieval with SPLADE sparse layer.
On the 200-question eval: overall recall@10 74% → 89%; table-specific QA 61% → 94%; earn-out fiscal-year confusion incidents dropped to zero. Parse latency added 8–22 seconds per 200-page agreement (parallel page workers) vs sub-second for plain text — acceptable at ingest time. Re-parse only on amended pages using content-hash diffing.
Technique decision table
| Document profile | Prefer | Avoid |
|---|---|---|
| Clean digital prose (reports, essays) | Layout parser + block chunking | Full-page OCR rasterization |
| Financial tables and schedules | Table-aware extraction; JSON or HTML cells; atomic table chunks | pdftotext flatten |
| Scanned contracts and filings | Per-page OCR + layout model on boxes | Assuming empty text layer means blank page |
| Two-column legal PDFs | Column detection before line merge | Single-column reading sort |
| Million-page archive, tight budget | Fast text-layer parse first; OCR queue for low-text pages only | Cloud OCR on every page |
| Charts and engineering drawings | Caption OCR + optional vision description chunk | Embedding raw image bytes in text index |
| Frequent document amendments | Block-level content hashing; re-parse changed pages only | Full re-ingest on any byte change |
Common pitfalls
- Treating parse as stateless — cache parsed trees; parsing dominates ingest cost on re-runs.
- Embedding before cleanup — headers duplicated 200 times become top hits for every query.
- Splitting tables across chunks — half a cap table retrieves without column headers; keep tables atomic.
- Ignoring parser version in evals — library upgrades shift bbox heuristics; re-benchmark gold sets.
- OCR confidence blindness — low-confidence cells should flag human review, not flow silently into answers.
- Missing language detection — OCR models and reading order differ for RTL scripts; route by locale.
- Skipping visual QA — spot-check random pages with bbox overlays; geometric errors are invisible in text-only logs.
- Conflating ingest with retrieval — fix parse before tuning embedding models or rerankers.
Production checklist
- Classify each page: digital text vs scan vs hybrid before choosing parser.
- Build a 50+ document gold set with labeled reading order and table cells.
- Emit structured blocks (type, bbox, page, section_path) not flat strings.
- Detect and remove repeating headers, footers, and page numbers.
- Extract tables to Markdown, HTML, or JSON; index as atomic chunks.
- Merge hyphenated line breaks and cross-page paragraph splits.
- Run OCR only on pages below text-density threshold; log CER on samples.
- Attach contextual section prefixes before embedding when headings are sparse.
- Version parse output with parser name + version hash; diff on re-ingest.
- Benchmark recall@k on table QA and clause QA separately after parse changes.
- Expose page/bbox citations in the UI so attorneys verify retrieved blocks.
- Monitor parse failures, OCR queue depth, and p95 ingest latency per doc type.
Key takeaways
- PDF text extraction order is not reading order — layout-aware parsing is prerequisite for reliable RAG on real documents.
- Tables must stay structured through ingest; flattening columns is a common source of fiscal-year and numeric hallucinations.
- Route born-digital and scanned pages differently; OCR everything is slow and error-prone on clean PDFs.
- Harbor Legal lifted table QA recall from 61% to 94% by replacing pdftotext with block-level parse, table pass, and header stripping.
- Benchmark parsing on your corpus before tuning embeddings — garbage layout in guarantees garbage answers out.
Related reading
- RAG document ingestion explained — end-to-end ingest stages beyond PDF
- LLM map-reduce document processing explained — exhaustive multi-chunk review after parse
- Optical character recognition (OCR) explained — scan pipelines and CER metrics
- LLM contextual retrieval explained — section-aware embedding prefixes