Guide

RAG citation and source attribution explained

Harbor Archive’s internal policy assistant answered “How long do we retain EU customer logs?” with a confident “90 days” — but the retrieved GDPR addendum actually said 90 days for marketing cookies, not server logs (180 days). Compliance reviewers had no way to see which paragraph the model read. Support tickets about “wrong citations” outnumbered accuracy complaints three to one. After shipping inline chunk citations with hover previews and a post-hoc span aligner that flags unsupported sentences, user-reported trust scores rose from 3.1 to 4.4 out of 5 and legal sign-off time on new deployments dropped by half.

Source attribution is the practice of linking each factual claim in a RAG answer to the retrieved passage that supports it. Citations turn opaque LLM prose into auditable output: users verify, compliance teams approve, and engineers debug retrieval failures without guessing. This guide covers citation formats, prompt and post-processing patterns, span grounding, UI affordances, the Harbor Archive refactor, a technique decision table vs RAG evaluation and hallucination guards, pitfalls, and a production checklist — building on RAG fundamentals and document ingestion.

Why citations matter beyond accuracy metrics

Offline faithfulness scores (see RAG evaluation) tell engineers whether answers match context. Citations tell users the same thing in product UI. They serve four distinct jobs:

  • Trust and verification — a clickable reference lets experts confirm the model read the right clause before acting on it.
  • Debuggability — when an answer is wrong, citations reveal whether retrieval, generation, or chunk boundaries caused the failure.
  • Compliance — regulated domains (finance, healthcare, legal) often require traceable provenance for automated advice.
  • Freshness signaling — showing document title and effective_date metadata warns users when policy may have changed.

A RAG system can score high on aggregate faithfulness while still feeling untrustworthy if answers arrive as undifferentiated paragraphs. Citations are a UX layer on top of retrieval quality — not a substitute for fixing bad chunks or hallucinations.

Citation formats and UI patterns

Inline numeric markers

The most common pattern: the model emits bracketed indices like “Retention is 180 days for server logs [2].” Each index maps to a retrieved chunk ID stored server-side. Render markers as superscript links; clicking opens a side panel with the full chunk text, source filename, page number, and deep link to the original PDF if available.

Footnote blocks

Collect references at the bottom of the answer (“Sources: [1] GDPR Addendum v3.2, p. 14”). Cleaner for long answers but weaker for sentence-level verification because users must match claims to footnotes manually.

Highlighted spans

After generation, underline each sentence and color-code by supporting chunk. Hover shows the source excerpt. Best for high-stakes review workflows; higher engineering cost than numeric markers.

Structured citation objects

Return JSON alongside prose: { "claim": "...", "chunk_id": "abc", "quote": "exact substring" }. Enables programmatic audit logs and integrates with structured outputs pipelines.

Implementation approaches

Prompt-native citations

Pass retrieved chunks numbered [1] through [N] in the system prompt and instruct the model: “Cite chunk numbers inline for every factual statement. Do not cite chunks that do not support the claim.” Cheap to ship; quality depends on model discipline. Stronger models follow instructions more reliably; smaller models may cite randomly or omit markers.

Include chunk metadata in the prompt header: title, section breadcrumb, page, and effective date. The model cannot cite what it cannot see — metadata from ingestion must land in the context block.

Post-hoc attribution (Attribution QA)

Generate the answer first without citation constraints, then run a second pass (smaller model or cross-encoder) that maps each sentence to the most relevant chunk span. Flag sentences with similarity below a threshold as unsupported and either strip them, append a disclaimer, or request human review.

Post-hoc alignment decouples fluency from faithfulness: the writer model optimizes readability; the aligner enforces evidence. Latency increases by one extra forward pass but citation precision often beats prompt-only approaches on long answers.

Native retrieval-augmented models

Some APIs return citation spans in the response payload (start/end offsets into provided documents). Prefer these when available — the serving stack handles alignment internally. Fall back to prompt-native or post-hoc when using open-weight models without built-in citation heads.

Claim extraction + verification loop

Split the answer into atomic claims (dates, amounts, policy names), retrieve a fresh mini-context per claim, and verify each independently — similar to chain-of-verification. Attach citations only to claims that pass verification. Highest latency; use for compliance-critical flows.

Chunk design for citable answers

Citations are only as good as the chunks they point to. Design choices from chunking strategy directly affect citation UX:

  • Stable chunk IDs — re-indexing must not rotate IDs users bookmarked; use content hashes or document_version + offset keys.
  • Human-readable labels — “GDPR Addendum §4.2” beats “chunk_8842” in footnotes.
  • Overlap for boundary claims — facts spanning two sections need parent-child chunk links so citations can point to both.
  • Quote-friendly length — chunks under 200 tokens cite cleanly; 800-token walls force users to hunt inside the preview.
  • Table and list preservation — serialized tables from ingestion should render in citation previews, not raw markdown soup.

Harbor Archive refactor (worked example)

Harbor Archive indexes 4,200 internal policy PDFs (HR, security, privacy). Baseline RAG: top-6 chunks, GPT-4o answer, no citations. Users could not tell whether “90 days” came from GDPR, SOC2, or a deprecated draft.

Changes shipped:

  • Prompt-native inline [n] markers with chunk headers showing title, section, page, and effective_date.
  • Post-hoc aligner (bge-reranker-large): any sentence with cross-encoder score < 0.35 marked “unverified” in UI.
  • Citation panel: PDF page thumbnail via pre-rendered images from ingestion pipeline.
  • Audit log: JSON row per answer with chunk IDs, similarity scores, and user feedback.

Results: citation precision (human-rated “correct source for claim”) = 88% on 200-question eval; unsupported-sentence flag caught 73% of faithfulness failures before users saw them. p95 latency +320 ms for aligner pass. Support tickets citing wrong sources fell 61% in six weeks.

What did not help: asking the model to paste full chunk quotes inline — answers became unreadable and quotes were often truncated mid-sentence. Footnote-only layout without sentence-level markers did not reduce compliance review time.

Technique decision table

Approach Best when Skip when
Prompt-native [n] markers Fast MVP, strong instruction-following model, short answers Long multi-claim answers; small models with poor instruction adherence
Post-hoc span aligner Sentence-level audit, compliance review, long generated prose Ultra-low-latency chat; aligner cost exceeds budget
Native API citation spans Vendor provides offsets; you want zero custom alignment code Self-hosted open models without citation heads
Claim-level verification loop Regulated advice, high cost of wrong numbers/dates Exploratory search, creative summaries
Faithfulness metrics only (no UI citations) Internal batch QA, no end-user-facing answers Any user-facing product where trust is the bottleneck
Footnotes without inline markers Short FAQ answers, mobile layouts with limited space Legal/compliance review needing per-sentence traceability

Common pitfalls

  • Citation theater — markers appear on every sentence but point to irrelevant chunks. Users learn to ignore them. Measure citation precision, not just presence.
  • Stale chunk metadata — citing “Policy v1” when v3 is indexed erodes trust faster than no citation. Surface version and date in UI.
  • Broken deep links — page offsets wrong after re-OCR or re-chunking. Tie links to stable document_version + character offset.
  • Over-citation — ten markers on a three-sentence answer clutter reading. Cite factual claims; skip obvious connective prose.
  • Missing unsupported handling — showing a citation for a hallucinated claim is worse than flagging “no source found.”
  • Ignoring retrieval failures — citations cannot fix empty or wrong retrieval. Pair attribution with query expansion and reranking upstream.

Production checklist

  • Ensure every chunk carries title, section, page, version, and effective_date metadata.
  • Number chunks consistently in the prompt context block.
  • Implement prompt-native inline markers as the first shipping milestone.
  • Build citation preview UI with full chunk text and source deep link.
  • Add post-hoc aligner for sentences above a length or risk threshold.
  • Flag or hide unsupported sentences below similarity threshold.
  • Log chunk IDs, scores, and user “wrong source” feedback per answer.
  • Measure citation precision on a labeled eval set (claim → correct chunk).
  • Re-test citations after embedding model or chunk schema changes.
  • Document citation behavior for compliance reviewers and support staff.

Key takeaways

  • Citations are a user-trust layer — offline faithfulness scores alone do not make RAG feel verifiable.
  • Prompt-native [n] markers are the fastest MVP; post-hoc aligners improve sentence-level precision on long answers.
  • Harbor Archive cut wrong-source tickets 61% with inline markers plus an unsupported-sentence flag.
  • Chunk metadata and stable IDs determine whether citations are clickable and audit-ready.
  • Measure citation precision, not citation count — fake markers destroy trust faster than none.

Related reading