Guide
RAG document ingestion explained
Harbor Archive's policy assistant indexed 4,200 internal PDFs with a naive
pdftotext pipeline. Two-column employee handbooks interleaved left and
right pages into nonsense paragraphs. Scanned vendor contracts returned empty strings.
Pricing tables collapsed into single lines where “$49” sat three tokens away
from “per seat.” Retrieval recall@10 on a 200-question golden set was 41%.
The refactor replaced raw extraction with a document ingestion pipeline:
layout-aware PDF parsing, conditional OCR on image-only pages, table serialization to
Markdown, heading breadcrumbs in metadata, and content-hash versioning before
chunking.
Recall@10 rose to 78% with the same embedding model and chunk size. Ingestion is the
unglamorous stage that decides whether downstream RAG ever sees real structure —
or scrambled text that no reranker can rescue.
Document ingestion is everything between a file landing in object storage and the first vector index write: format detection, text and layout extraction, cleaning, structural annotation, metadata attachment, deduplication, and handoff to chunkers. It is not embedding, retrieval, or generation — but most production RAG failures trace back here. This guide covers the ingestion pipeline stages, format-specific parsers (PDF, HTML, office docs, scans), table and image handling, metadata and provenance, the Harbor Archive refactor, a technique decision table vs naive loaders and multimodal shortcuts, pitfalls, and a production checklist.
What document ingestion does in RAG
A RAG pipeline assumes you already have clean, linear text with trustworthy boundaries. Real corpora arrive as PDF exports, Confluence HTML, zipped email threads, photographed invoices, and Git repositories. Ingestion normalizes those sources into a canonical document object — typically a list of elements (title, heading, paragraph, list, table, code block) each with text, type, page number, and source URI.
The output feeds three downstream consumers:
- Chunkers — split elements into retrievable units without breaking tables or legal clauses.
- Embedders — map chunk text to vectors; garbage text produces garbage vectors.
- Citation UI — link answers back to page, section, and file version via metadata stamped during ingestion.
Skipping ingestion and calling file.read() on a PDF is the most common
reason teams blame the embedding model when retrieval fails.
Pipeline stages
Production ingestion is a directed acyclic graph, not a single function. Typical stages:
1. Intake and format detection
Accept uploads, S3 events, or webhook syncs. Sniff MIME type and magic bytes —
do not trust file extensions alone (policy.pdf may be a renamed DOCX).
Route to the correct parser. Queue large batches; rate-limit OCR-heavy jobs.
2. Extraction
Turn bytes into structured elements. PDFs need layout analysis; HTML needs DOM walking with script and nav stripped; DOCX needs style-map to headings. Scanned pages route to OCR. Extraction is where most quality is won or lost.
3. Cleaning and normalization
Remove repeated headers and footers, hyphenation artifacts, soft line breaks, control characters, and boilerplate (“Page 3 of 47”, copyright blocks). Normalize Unicode (NFKC), collapse excessive whitespace, standardize bullet characters. Redact PII if required before indexing.
4. Structural enrichment
Assign heading levels, detect lists and tables, attach breadcrumb paths
(Handbook > Benefits > PTO), detect language per section. Optional
LLM pass can label document type (contract, FAQ, runbook) for routing filters.
5. Metadata and provenance
Stamp source_uri, content_hash, ingested_at,
page_range, acl_tenant, and version_id. Every
chunk inherits this metadata for filtered retrieval and citation links.
6. Handoff to chunking and indexing
Emit element streams to your chunking strategy, then embed and upsert into the vector store and optional BM25 index for hybrid search.
Format-specific extraction
PDF (digital text)
Never use plain pdftotext on multi-column or form PDFs. Use layout-aware
tools — Unstructured, Docling, PyMuPDF with block coordinates, or cloud APIs
(Adobe PDF Extract, Textract) — that read bounding boxes and reading order.
Preserve headings as Title / Header elements, not bold
paragraphs guessed by font size heuristics alone.
PDF (scanned / image-only)
Detect pages with no extractable text layer. Route to OCR (Tesseract, PaddleOCR, cloud vision). Store confidence scores; flag low-confidence pages for human review or exclude from high-stakes retrieval. Deskew and denoise preprocessing materially improves recall on fax-quality scans.
HTML and wikis
Parse DOM, drop nav, footer, cookie banners, and sidebar
chrome. Convert tables to Markdown or HTML strings inside a single element. Respect
article and heading tags. For SPAs, prefer export APIs or prerendered
snapshots over scraping empty shells.
Office documents (DOCX, PPTX, XLSX)
DOCX: map Word styles to heading levels; extract comments and footnotes separately or exclude. PPTX: one element per slide with speaker notes appended. XLSX: index each sheet; serialize tables with header row detection; consider one row-group chunk per logical record for CRM exports.
Plain text, Markdown, and code
Lowest risk if encoding is UTF-8. Markdown: split on headings natively. Code repos:
respect file boundaries and syntax — do not concatenate unrelated files. Store
repo_path and commit_sha in metadata for developer RAG.
Tables, images, and attachments
Tables are the highest-value and highest-risk content in enterprise RAG.
- Serialization: convert to Markdown tables or HTML
<table>inside one element so headers stay attached to rows. - Dual indexing: store a natural-language summary chunk (“Enterprise tier: $49/seat, 99.9% SLA”) linked to the full table element ID for precise numeric answers.
- Wide tables: split by logical row groups or pivot key columns into sentence form for embedding while keeping the full table retrievable by ID.
Images and diagrams: for text-only RAG, run captioning (vision model or alt text) and index the caption as text with a pointer to the image asset. For true multimodal RAG, see multimodal models — but caption-first ingestion still helps hybrid text retrieval find the right page.
Harbor Archive refactor (worked example)
Harbor Archive indexed HR policies, vendor MSAs, and security runbooks for an internal
Q&A portal. The v1 pipeline used PyPDF2.extract_text() plus fixed
512-token chunking. Symptoms: indemnification clauses split from defined terms; PTO
tables unreadable; 18% of files returned empty (scanned amendments).
v2 ingestion changes:
- Parser: Unstructured
hi_reslayout mode for PDFs; Tesseract OCR fallback when text density < 50 chars/page. - Table rule: detected tables serialized to Markdown; never split across chunks.
- Heading breadcrumbs: prepended to every paragraph element before chunking.
- Dedup:
content_hashper file; re-ingest only on hash change. - ACL metadata:
departmentandclassificationfields for filtered retrieval.
Same bge-base-en-v1.5 embeddings and 512-token chunks. Golden-set
recall@10: 41% → 78%. Answer faithfulness (LLM-judge): 62% → 84%. Median
ingestion cost rose from $0.002 to $0.018 per document (OCR on 12% of pages) —
still negligible vs support ticket volume saved.
Technique decision table
| Approach | Best for | Trade-off |
|---|---|---|
Naive text dump (read(), basic PDF text) |
Prototypes, single-column plain PDFs | Fast; fails on layout, scans, tables |
| Layout-aware open-source (Unstructured, Docling) | Most production PDF/HTML corpora | CPU/GPU cost; tune per doc type |
| Cloud document AI (Textract, Document AI) | High-volume scans, forms, compliance SLAs | Per-page cost; vendor lock-in |
| OCR-only pipeline | Legacy paper archives, faxes | Slow; error-prone without confidence gating |
| Multimodal page images (VLM per page) | Heavy charts, slides, irregular layouts | Expensive; harder to cite exact spans |
| Source-native export (HTML API, Git clone) | Wikis, repos, CMS with structured export | Best quality when available; not always offered |
Common pitfalls
- Trusting file extensions — route by magic bytes and parser probe, not
.pdfin the filename. - Ignoring reading order — multi-column PDFs and footnotes interleave without layout analysis.
- Chunking before cleaning — headers repeated in every chunk pollute embeddings; strip boilerplate first.
- Splitting tables — numeric answers require header row co-located with data cells.
- No version stamps — stale chunks linger after source update; always hash and re-index on change.
- OCR without confidence thresholds — “indernification” retrieves nothing useful.
- Skipping ACL metadata — ingestion must tag tenant/role before vectors are searchable.
- One parser for all formats — HTML nav chrome indexed as policy text destroys precision.
Production checklist
- Detect format by content, not extension; maintain a parser routing table.
- Use layout-aware PDF extraction for any multi-column or form document.
- OCR image-only pages with confidence scores; quarantine low-confidence output.
- Serialize tables as single elements; optional summary chunks for semantic recall.
- Strip repeated headers, footers, and page numbers before chunking.
- Attach heading breadcrumbs and page numbers to every element.
- Stamp
source_uri,content_hash, andversion_idon all chunks. - Apply ACL / tenant metadata at ingest time for filtered retrieval.
- Re-ingest on hash change; tombstone deleted sources in the vector index.
- Measure recall@k on a golden set before tuning embeddings or chunk size.
- Log per-stage latency and cost (OCR pages, cloud API calls) for capacity planning.
Key takeaways
- Document ingestion converts raw files into structured elements with metadata — the foundation every downstream RAG stage depends on.
- Harbor Archive raised recall@10 from 41% to 78% with layout-aware parsing and table rules — same embeddings and chunk size.
- Layout-aware PDF extraction and conditional OCR matter more than swapping embedding models for messy corpora.
- Tables, scans, and multi-column PDFs need explicit handling — naive text dump is a recall ceiling.
- Version hashes and ACL metadata belong at ingest time, not bolted on after indexing.
Related reading
- RAG chunking strategies explained — what happens after clean elements are extracted
- RAG explained — full retrieval-augmented generation pipeline
- Hybrid search explained — BM25 plus dense retrieval on ingested text
- RAG evaluation explained — measuring whether ingestion improvements lift recall and faithfulness