Guide

RAG document ingestion explained

Harbor Archive's policy assistant indexed 4,200 internal PDFs with a naive pdftotext pipeline. Two-column employee handbooks interleaved left and right pages into nonsense paragraphs. Scanned vendor contracts returned empty strings. Pricing tables collapsed into single lines where “$49” sat three tokens away from “per seat.” Retrieval recall@10 on a 200-question golden set was 41%. The refactor replaced raw extraction with a document ingestion pipeline: layout-aware PDF parsing, conditional OCR on image-only pages, table serialization to Markdown, heading breadcrumbs in metadata, and content-hash versioning before chunking. Recall@10 rose to 78% with the same embedding model and chunk size. Ingestion is the unglamorous stage that decides whether downstream RAG ever sees real structure — or scrambled text that no reranker can rescue.

Document ingestion is everything between a file landing in object storage and the first vector index write: format detection, text and layout extraction, cleaning, structural annotation, metadata attachment, deduplication, and handoff to chunkers. It is not embedding, retrieval, or generation — but most production RAG failures trace back here. This guide covers the ingestion pipeline stages, format-specific parsers (PDF, HTML, office docs, scans), table and image handling, metadata and provenance, the Harbor Archive refactor, a technique decision table vs naive loaders and multimodal shortcuts, pitfalls, and a production checklist.

What document ingestion does in RAG

A RAG pipeline assumes you already have clean, linear text with trustworthy boundaries. Real corpora arrive as PDF exports, Confluence HTML, zipped email threads, photographed invoices, and Git repositories. Ingestion normalizes those sources into a canonical document object — typically a list of elements (title, heading, paragraph, list, table, code block) each with text, type, page number, and source URI.

The output feeds three downstream consumers:

Chunkers — split elements into retrievable units without breaking tables or legal clauses.
Embedders — map chunk text to vectors; garbage text produces garbage vectors.
Citation UI — link answers back to page, section, and file version via metadata stamped during ingestion.

Skipping ingestion and calling file.read() on a PDF is the most common reason teams blame the embedding model when retrieval fails.

Pipeline stages

Production ingestion is a directed acyclic graph, not a single function. Typical stages:

1. Intake and format detection

Accept uploads, S3 events, or webhook syncs. Sniff MIME type and magic bytes — do not trust file extensions alone (policy.pdf may be a renamed DOCX). Route to the correct parser. Queue large batches; rate-limit OCR-heavy jobs.

2. Extraction

Turn bytes into structured elements. PDFs need layout analysis; HTML needs DOM walking with script and nav stripped; DOCX needs style-map to headings. Scanned pages route to OCR. Extraction is where most quality is won or lost.

3. Cleaning and normalization

Remove repeated headers and footers, hyphenation artifacts, soft line breaks, control characters, and boilerplate (“Page 3 of 47”, copyright blocks). Normalize Unicode (NFKC), collapse excessive whitespace, standardize bullet characters. Redact PII if required before indexing.

4. Structural enrichment

Assign heading levels, detect lists and tables, attach breadcrumb paths (Handbook > Benefits > PTO), detect language per section. Optional LLM pass can label document type (contract, FAQ, runbook) for routing filters.

5. Metadata and provenance

Stamp source_uri, content_hash, ingested_at, page_range, acl_tenant, and version_id. Every chunk inherits this metadata for filtered retrieval and citation links.

6. Handoff to chunking and indexing

Emit element streams to your chunking strategy, then embed and upsert into the vector store and optional BM25 index for hybrid search.

Format-specific extraction

PDF (digital text)

Never use plain pdftotext on multi-column or form PDFs. Use layout-aware tools — Unstructured, Docling, PyMuPDF with block coordinates, or cloud APIs (Adobe PDF Extract, Textract) — that read bounding boxes and reading order. Preserve headings as Title / Header elements, not bold paragraphs guessed by font size heuristics alone.

PDF (scanned / image-only)

Detect pages with no extractable text layer. Route to OCR (Tesseract, PaddleOCR, cloud vision). Store confidence scores; flag low-confidence pages for human review or exclude from high-stakes retrieval. Deskew and denoise preprocessing materially improves recall on fax-quality scans.

HTML and wikis

Parse DOM, drop nav, footer, cookie banners, and sidebar chrome. Convert tables to Markdown or HTML strings inside a single element. Respect article and heading tags. For SPAs, prefer export APIs or prerendered snapshots over scraping empty shells.

Office documents (DOCX, PPTX, XLSX)

DOCX: map Word styles to heading levels; extract comments and footnotes separately or exclude. PPTX: one element per slide with speaker notes appended. XLSX: index each sheet; serialize tables with header row detection; consider one row-group chunk per logical record for CRM exports.

Plain text, Markdown, and code

Lowest risk if encoding is UTF-8. Markdown: split on headings natively. Code repos: respect file boundaries and syntax — do not concatenate unrelated files. Store repo_path and commit_sha in metadata for developer RAG.

Tables, images, and attachments

Tables are the highest-value and highest-risk content in enterprise RAG.

Serialization: convert to Markdown tables or HTML <table> inside one element so headers stay attached to rows.
Dual indexing: store a natural-language summary chunk (“Enterprise tier: $49/seat, 99.9% SLA”) linked to the full table element ID for precise numeric answers.
Wide tables: split by logical row groups or pivot key columns into sentence form for embedding while keeping the full table retrievable by ID.

Images and diagrams: for text-only RAG, run captioning (vision model or alt text) and index the caption as text with a pointer to the image asset. For true multimodal RAG, see multimodal models — but caption-first ingestion still helps hybrid text retrieval find the right page.

Harbor Archive refactor (worked example)

Harbor Archive indexed HR policies, vendor MSAs, and security runbooks for an internal Q&A portal. The v1 pipeline used PyPDF2.extract_text() plus fixed 512-token chunking. Symptoms: indemnification clauses split from defined terms; PTO tables unreadable; 18% of files returned empty (scanned amendments).

v2 ingestion changes:

Parser: Unstructured hi_res layout mode for PDFs; Tesseract OCR fallback when text density < 50 chars/page.
Table rule: detected tables serialized to Markdown; never split across chunks.
Heading breadcrumbs: prepended to every paragraph element before chunking.
Dedup: content_hash per file; re-ingest only on hash change.
ACL metadata: department and classification fields for filtered retrieval.

Same bge-base-en-v1.5 embeddings and 512-token chunks. Golden-set recall@10: 41% → 78%. Answer faithfulness (LLM-judge): 62% → 84%. Median ingestion cost rose from $0.002 to $0.018 per document (OCR on 12% of pages) — still negligible vs support ticket volume saved.

Technique decision table

Approach	Best for	Trade-off
Naive text dump (`read()`, basic PDF text)	Prototypes, single-column plain PDFs	Fast; fails on layout, scans, tables
Layout-aware open-source (Unstructured, Docling)	Most production PDF/HTML corpora	CPU/GPU cost; tune per doc type
Cloud document AI (Textract, Document AI)	High-volume scans, forms, compliance SLAs	Per-page cost; vendor lock-in
OCR-only pipeline	Legacy paper archives, faxes	Slow; error-prone without confidence gating
Multimodal page images (VLM per page)	Heavy charts, slides, irregular layouts	Expensive; harder to cite exact spans
Source-native export (HTML API, Git clone)	Wikis, repos, CMS with structured export	Best quality when available; not always offered

Common pitfalls

Trusting file extensions — route by magic bytes and parser probe, not .pdf in the filename.
Ignoring reading order — multi-column PDFs and footnotes interleave without layout analysis.
Chunking before cleaning — headers repeated in every chunk pollute embeddings; strip boilerplate first.
Splitting tables — numeric answers require header row co-located with data cells.
No version stamps — stale chunks linger after source update; always hash and re-index on change.
OCR without confidence thresholds — “indernification” retrieves nothing useful.
Skipping ACL metadata — ingestion must tag tenant/role before vectors are searchable.
One parser for all formats — HTML nav chrome indexed as policy text destroys precision.

Production checklist

Detect format by content, not extension; maintain a parser routing table.
Use layout-aware PDF extraction for any multi-column or form document.
OCR image-only pages with confidence scores; quarantine low-confidence output.
Serialize tables as single elements; optional summary chunks for semantic recall.
Strip repeated headers, footers, and page numbers before chunking.
Attach heading breadcrumbs and page numbers to every element.
Stamp source_uri, content_hash, and version_id on all chunks.
Apply ACL / tenant metadata at ingest time for filtered retrieval.
Re-ingest on hash change; tombstone deleted sources in the vector index.
Measure recall@k on a golden set before tuning embeddings or chunk size.
Log per-stage latency and cost (OCR pages, cloud API calls) for capacity planning.

Key takeaways

Document ingestion converts raw files into structured elements with metadata — the foundation every downstream RAG stage depends on.
Harbor Archive raised recall@10 from 41% to 78% with layout-aware parsing and table rules — same embeddings and chunk size.
Layout-aware PDF extraction and conditional OCR matter more than swapping embedding models for messy corpora.
Tables, scans, and multi-column PDFs need explicit handling — naive text dump is a recall ceiling.
Version hashes and ACL metadata belong at ingest time, not bolted on after indexing.