Guide
LLM RAG table extraction explained
Harbor Finance’s earnings assistant indexed 2,400 quarterly filings by flattening every 10-K table into a single paragraph per page. When analysts asked “What was Q3 2024 operating margin for the Cloud segment?” the bot returned 18.2% — a figure from the consolidated income statement, not the segment breakdown three pages later. On 180 curated numeric QA pairs, wrong-line-item rate was 31% even though recall@5 on the right document was 88%. The tables were found; their structure was destroyed at ingest.
After rebuilding with dedicated table extraction — layout detection, markdown serialization with frozen headers, row-level chunks carrying column names, and a parallel structured store for exact lookups — wrong-line-item rate fell to 7% and numeric exact-match accuracy rose from 64% to 91%. This guide explains why tables fail in naive RAG, detection and serialization choices, chunking granularity, header propagation across splits, multi-page and nested tables, retrieval routing, the Harbor Finance refactor, a technique decision table, pitfalls, and a production checklist. It complements PDF parsing, document ingestion, and chunking strategies for teams whose corpora are dense with financial statements, pricing grids, and specification matrices.
Why tables break naive RAG pipelines
Tables encode meaning in two dimensions: row labels identify entities (product lines, fiscal quarters, geographic regions) and column headers identify measures (revenue, margin, YoY change). Flattening to left-to-right text destroys the join between label and value. Common failure modes:
- Column scramble — multi-column PDF layouts interleave FY2023 and FY2024 values; the model cannot tell which number belongs to which year.
- Lost headers — a 40-row table split across three chunks leaves later rows without column context; the LLM invents units or confuses “Actual” with “Guidance.”
- Subtotal blindness — indented rows (segment under division) read as independent lines when whitespace hierarchy is stripped.
- Unit drift — “Revenue (in millions)” in a caption is dropped; answers cite raw cell digits as full dollars.
- Spanning cells — merged header cells in HTML or PDF produce duplicate or empty column names in naive parsers.
Embedding models trained on prose treat “$4.2B 12.1% Cloud” as a bag of tokens. Questions like “compare segment margins Q2 vs Q3” need structured retrieval, not paraphrase similarity alone.
Detection: finding tables in documents
Before serialization, you must locate table boundaries. Approaches by source type:
Born-digital PDF and Office files
Layout parsers (pdfplumber, PyMuPDF with table mode, Unstructured.io, Docling) cluster text boxes by alignment rules: shared x-coordinates define columns, y-gaps define rows. Rule-based detectors work well on grid-line tables; borderless financial tables need tolerance tuning on column drift and right-aligned numerics.
Scanned PDFs and images
Pipeline: page image → OCR with bounding boxes → table structure model (Table Transformer, PaddleOCR structure, commercial APIs). OCR errors in decimal points are catastrophic for numeric QA; store confidence scores and flag low-confidence cells for human review.
HTML and spreadsheets
Native <table> markup and XLSX/CSV are the easy path —
parse DOM or sheet grids directly. Watch for layout tables used for page design;
filter by heuristics (data density, header row presence, numeric cell ratio).
Store provenance on every extracted table: doc_id, page number, bounding
box, table index, caption text, and extraction confidence. Citations must point users
to the visual source, not an anonymous text blob.
Serialization formats: how to represent cells as text
Once cells are identified, choose a serialization that preserves structure for both embedding and LLM synthesis:
| Format | Best for | Trade-offs |
|---|---|---|
| Markdown pipe tables | LLM context windows, human-readable chunks | Wide tables wrap poorly; merged cells need preprocessing |
HTML <table> |
Preserving colspan/rowspan, web-native corpora | Token-heavy; some models parse HTML inconsistently |
| Row-as-JSON | Machine lookup, metadata filters, exact key match | Poor embedding semantics unless you also emit NL summaries |
| Natural-language rows | Dense retrieval (“Cloud segment Q3 2024 operating margin was 18.2%”) | Verbose; must generate programmatically to avoid hallucinated labels |
| Tidy CSV / Parquet | Text-to-SQL sidecar, analytics queries | Requires separate query path; not a drop-in for vector-only RAG |
Production systems often use dual representation: markdown or NL rows for vector search, plus a normalized relational or columnar store for exact filters (“segment = Cloud AND quarter = Q3 2024”). The LLM receives whichever representation matches the query class.
Chunking granularity: whole table, row group, or cell
Chunk size for tables differs from prose chunking strategies:
- Whole-table chunk — works for small pricing matrices (< 15 rows). Embedding captures the full comparison grid; risk is token overflow and diluted vectors on wide tables.
- Row-level chunks — each data row becomes one chunk with headers prepended (“Segment | Quarter | Revenue | Margin” plus row values). Best default for financial statements; pairs well with parent-child indexes where the full table is the parent.
- Row-group chunks — batch 5–10 related rows (one business unit) when row-level creates thousands of near-duplicate vectors.
- Cell-level — rarely embedded; used in knowledge graphs or when each cell has rich metadata. High index cost, high precision for pinpoint QA.
Never split a table mid-row. If a table exceeds your chunk token budget, split on row boundaries and repeat headers on every chunk.
Header propagation and multi-page tables
Long tables repeat header rows only on the first page — or not at all when continued on page N+1. Ingest rules:
- Detect continuation tables via matching column count and x-alignment across pages.
- Carry forward the last seen header row onto every subsequent data row chunk.
- Prepend table caption and unit notes (“USD millions unless noted”) to every row chunk, not only the first.
- For stacked header rows (year over quarter), flatten to composite keys:
2024_Q3not bareQ3when multiple years appear. - Mark subtotal and total rows explicitly (
row_type: subtotal) so the LLM does not double-count when summing.
Nested tables (table inside a table cell) should be extracted as separate units with
a parent_row_id pointer. Flattening nested structures into one grid
misaligns columns unpredictably.
Retrieval routing: when to use vectors vs structured lookup
Not every table question should hit the vector index:
- Semantic / paraphrase questions (“how did cloud profitability trend?”) → embed NL row summaries or markdown chunks.
- Exact numeric lookups (“Q3 2024 cloud operating margin”) → metadata filter + structured store or text-to-SQL over normalized tables.
- Cross-table joins (“compare capex in 10-K to guidance in the earnings release”) → agentic multi-hop or hybrid search on table captions plus row chunks.
A lightweight intent classifier (or regex on quarter codes, ticker symbols, and unit patterns) routes queries before retrieval. Sending structured lookups through pure embedding search is the main cause of “right document, wrong cell” errors.
Harbor Finance refactor (worked example)
Harbor Finance replaced page-level text dumps with a table-first pipeline on 2,400 SEC filings:
- Detect — Docling layout model for tables; fallback to pdfplumber on low-confidence pages.
- Normalize — each table → tidy dataframe with snake_case column keys derived from flattened multi-row headers.
- Dual index — row-level NL summaries embedded in Pinecone; parquet files in DuckDB for filtered SQL.
- Route — queries containing fiscal period tokens or “what was / how much” patterns hit SQL first; narrative questions hit vectors.
- Synthesize — LLM receives top-k row chunks plus SQL result rows; prompt requires citing table caption and page.
Wrong-line-item rate dropped from 31% to 7%. Numeric exact-match rose from 64% to 91%. Index size grew 2.3× (row chunks vs page blobs) but query latency fell 18% because SQL short-circuited vector search on 44% of analyst questions.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| Filings with 50+ financial tables per doc | Row chunks + SQL sidecar + query router | Whole-document prose embedding |
| Small spec sheets (< 20 rows) | Whole-table markdown chunk | Per-cell embedding explosion |
| Scanned legacy archives | OCR + structure model + confidence gating | Assuming born-digital text layers exist |
| Mostly narrative PDFs, few tables | Table detection as optional enrich step on detected pages | Building full SQL warehouse upfront |
| Real-time spreadsheet uploads | Direct CSV/Parquet ingest + column typing | Re-PDF-ing sheets before parse |
Common pitfalls
- Embedding tables without headers on every chunk — the dominant production bug; fix with mandatory header injection.
- Ignoring footnotes and captions — units and exclusions live outside the grid; attach caption text to table metadata.
- Treating percentages and basis points interchangeably —
normalize at ingest with explicit
unitfields. - Skipping validation on golden numeric QA sets — prose RAG metrics miss wrong-cell failures; maintain a table-specific eval set.
- Over-trusting LLM table regeneration — never ask the model to re-typeset a table from memory; always retrieve source cells.
- Duplicate tables across filings — restated prior-period columns create conflicting chunks; version by filing date and prefer latest restatement with deduplication.
Production checklist
- Run layout-aware table detection; log confidence per table.
- Serialize to markdown or NL rows with explicit column labels on every chunk.
- Carry caption, units, and page provenance in chunk metadata.
- Split only on row boundaries; repeat headers on continuation pages.
- Normalize numeric cells (type, scale, currency) at ingest.
- Build a structured sidecar (SQL/Parquet) for exact lookup queries.
- Route numeric vs narrative questions before retrieval.
- Maintain a golden set of table QA pairs separate from prose evals.
- Cite table caption + page in synthesis prompts.
- Version tables by source document date; dedupe restated figures.
Key takeaways
- Tables are two-dimensional data structures — flattening them to prose destroys the label-value relationships RAG depends on for numeric QA.
- Row-level chunks with repeated headers are the default sweet spot for large financial and specification tables.
- Dual representation — vectors for semantic questions, structured stores for exact lookups — cuts wrong-cell errors more than a better embedding model alone.
- Harbor Finance reduced wrong-line-item answers from 31% to 7% by detecting tables, normalizing rows, and routing structured queries to SQL before synthesis.
- Evaluate table RAG with numeric exact-match benchmarks, not only recall@k on document titles.
Related reading
- LLM PDF document parsing explained — layout analysis and reading order before table extraction
- RAG document ingestion explained — end-to-end ingest pipeline placement for table modules
- LLM text-to-SQL explained — querying normalized table stores in natural language
- RAG evaluation explained — building golden sets that include structured QA