Guide
LLM multimodal RAG explained
Harbor Insurance's claims copilot indexed policy PDFs with a standard text pipeline: OCR, chunking, and dense text embeddings. When an adjuster asked whether flood damage to a detached garage was covered, the model quoted a paragraph about “structures on the insured premises” and answered yes. Page 14 of the same policy included a diagram with a red exclusion zone around detached outbuildings — information that never survived OCR as reliable text. The wrong answer reached a customer email draft before a human caught it.
Multimodal RAG extends retrieval-augmented generation beyond plain text: it indexes and retrieves images, chart crops, rendered PDF pages, and structured table visuals so a vision-language model (VLM) can ground answers in what documents actually look like, not only what OCR extracted. After Harbor added page-image embeddings, diagram-aware chunk metadata, and a VLM synthesis step, diagram-contradiction failures dropped sharply on their golden eval set. This guide covers when multimodal RAG earns its complexity, indexing and retrieval patterns, generation with VLMs, the Harbor Insurance refactor, a technique decision table versus text-only RAG, pitfalls, and a production checklist.
When text-only RAG fails
Many high-value documents are inherently visual. Insurance policies, engineering schematics, medical imaging reports, financial pitch decks, and scanned government forms encode constraints in layout, color, icons, and spatial relationships that flatten poorly into strings.
Text-only pipelines break in predictable ways:
- OCR drops structure — merged cells, multi-column layouts, and footnotes become garbled tokens unsuitable for semantic chunking.
- Charts and diagrams lack serial order — a bar chart's meaning lives in bar heights and axis labels together; paragraph extraction loses the joint signal.
- Scanned handwriting and stamps — low OCR confidence silently omits clauses that remain legible to human eyes.
- Cross-modal references — “see Figure 3” in text points to an image chunk that text retrieval never surfaces.
Multimodal RAG does not replace text retrieval; it adds parallel visual evidence channels so synthesis models see both the extracted clause and the figure it annotates.
Indexing taxonomy
Choose representation based on document type, latency budget, and whether you need pixel-level grounding for citations.
Rendered page images
Rasterize each PDF page (or slide) to PNG at 150–300 DPI. Embed whole pages with CLIP-style dual encoders or document-focused models (ColPali, ColQwen). Simple to implement; retrieval returns full pages the VLM reads. Storage-heavy but robust for mixed layouts.
Figure and chart crops
Detect figures with layout models during ingestion, crop bounding boxes, and index crops separately with captions from nearby text blocks. Finer retrieval granularity; requires reliable layout detection.
Text + image dual indexes
Keep a text vector index for paragraphs and a visual index for page or figure embeddings. At query time, fuse ranked lists with reciprocal rank fusion (RRF) or learn a lightweight reranker. Most production systems land here.
Structured table serialization
For tables that OCR handles well, store HTML or markdown tables in the text index and retain a screenshot for dispute resolution. The VLM receives both when numeric precision matters.
Native multimodal embeddings
Newer models embed text and image in a shared space without a separate OCR step. Quality is improving rapidly; evaluate on your domain before dropping the text channel entirely.
Retrieval and fusion patterns
Multimodal retrieval mirrors text RAG stages with modality-specific knobs:
- Query encoding — embed the user question with the same text tower used at index time; optionally append a HyDE image caption from an LLM for visual recall (see query expansion).
- Top-k per modality — typical starting point: 20 text chunks + 5 page images; tune from labeled Q&A.
- RRF fusion — merge ranked lists without score calibration across heterogeneous embedders; robust default when text and image scores live on different scales.
- Cross-encoder rerank — score (query, candidate) pairs with a multimodal reranker before the context window fills; pairs well with text reranking on the paragraph index.
- Metadata filters — policy ID, effective date, and document section tags narrow candidates before vector search.
Cap total retrieved tokens and image patches before generation; oversized visual context slows VLMs and dilutes attention.
Answer generation with VLMs
After retrieval, pass a structured context bundle to a VLM (GPT-4o, Gemini, Claude with vision, open Qwen-VL, etc.):
- Text passages — top paragraphs with source IDs.
- Page images — retrieved PNGs in document order.
- System policy — cite page numbers; refuse when evidence conflicts; quote numeric limits verbatim from tables.
Prompt the model to cross-check text claims against diagram content explicitly (“if a figure contradicts a paragraph, prefer the figure and explain why”). Add inline citations tied to page image IDs so auditors can open the exact raster.
For latency-sensitive paths, use a cascade: text-only RAG first; escalate to multimodal only when confidence is low or the query mentions figures, charts, or layouts.
Harbor Insurance claims refactor (worked example)
Harbor's failure mode was diagram blindness on detached-structure exclusions. Refactor steps:
- Ingestion upgrade — layout-aware PDF parse plus 200 DPI page renders stored in object storage with content-hash keys.
- Dual index — existing text chunks unchanged; added ColPali
page embeddings in a parallel vector collection tagged
modality=page_image. - Figure linker — post-ingestion job matched OCR phrases like “Figure N” to cropped figure assets for cross-reference metadata.
- Fused retrieval — top-15 text + top-3 pages via RRF; multimodal cross-encoder rerank to top-5 text + top-2 images.
- VLM synthesis — GPT-4o with citation template; conflicts between text and diagram flagged for human review instead of auto-send.
- Eval harness — 120 labeled claims questions with gold page references; tracked recall@page and hallucination rate weekly.
Outcome: page recall on diagram-dependent questions rose from 41% to 89%; incorrect auto-drafts on the eval set fell from 12% to 2%. Median latency increased 1.4 seconds on escalated multimodal paths — acceptable for adjuster-assist, not for sub-second chat widgets.
Technique decision table
| Approach | Best for | Indexing cost | Answer quality on visuals | Complexity |
|---|---|---|---|---|
| Text-only RAG (OCR + embeddings) | Clean digital PDFs, prose-heavy KBs | Low | Poor on charts and scans | Low |
| OCR + table HTML + text RAG | Spreadsheet-like docs with reliable structure | Medium | Good for tables; weak on diagrams | Medium |
| Page-image embeddings + VLM | Mixed-layout policies, decks, manuals | High (storage + embed) | Strong | Medium–high |
| Figure crops + dual index + fusion | Figure-heavy textbooks, engineering docs | High | Strongest granularity | High |
| End-to-end VLM on whole doc (no retrieval) | Very short documents only | Low index; extreme inference cost | Variable; context limits bite | Low ops; poor scale |
Start text-only per RAG fundamentals; add visual channels when eval shows systematic diagram or layout failures, not preemptively on every corpus.
Common pitfalls
- Skipping text index — pure image retrieval misses fine print clauses; keep both channels.
- Low-DPI renders — 72 DPI page images blur small footnotes; 150 DPI minimum for legal docs.
- Unbounded image context — stuffing 10 full pages into a VLM blows token budgets and lost-in-the-middle effects; rerank aggressively.
- Trusting OCR captions for figures — auto-captions hallucinate axis labels; let the VLM read the crop.
- No conflict policy — when diagram and paragraph disagree, models often pick fluent prose; force explicit conflict handling.
- Ignoring PII in page images — signatures and ID scans in rasters need the same redaction pipeline as text.
- Evaluating only on text questions — multimodal gains hide until you label figure-dependent gold sets.
- Same embedder for index and query without version pins — re-embedding entire corpora on model upgrades is expensive; version your visual index.
Production checklist
- Audit corpus for figure-, chart-, and layout-dependent questions before adding visual indexes.
- Store page renders with stable IDs linked to source PDF hashes and page numbers.
- Maintain parallel text and visual vector indexes with shared metadata filters.
- Fuse modality results with RRF or a calibrated multimodal reranker.
- Cap images per request (typically 1–3 pages) after reranking.
- Prompt VLMs to cross-check text against diagrams and cite page images.
- Route low-confidence or conflict answers to human review, not auto-send.
- Track recall@page and hallucination rate on a labeled multimodal eval set.
- Apply PII redaction to rasters as well as text chunks.
- Version visual embedding models; plan re-index jobs when upgrading.
- Use cascades: text-only first, multimodal escalation on demand.
- Log retrieved page IDs with each answer for audit and debugging.
Key takeaways
- Diagrams, charts and scan layout carry policy meaning that OCR-only RAG systematically loses.
- Dual text + page-image indexes with RRF fusion are the practical production default for mixed documents.
- Harbor Insurance cut diagram-contradiction failures by retrieving rendered pages and forcing VLM cross-checks against figures.
- Add multimodal channels when eval proves visual gaps — not on every corpus by default.
- Citations must point to page images as well as text spans so auditors can verify visual evidence.
Related reading
- RAG document ingestion explained — PDF parsing, OCR, layout extraction and index pipelines
- RAG chunking strategies explained — fixed, semantic and parent-child retrieval
- Vision-language models explained — how VLMs encode images and text jointly
- RAG citation and source attribution explained — inline references and span grounding