Guide

LLM agent file upload and attachment ingestion pipeline systems explained

Harbor Legal shipped a contract-review agent where paralegals drag PDFs into chat and ask for clause summaries, red-flag tables, and counterparty risk notes. Week two of production: a vendor uploaded a 48 MB archive renamed msa.pdf, the client sent the raw bytes to a generic “read file” tool, OCR dumped 400,000 tokens of boilerplate and embedded macros into context, and the model hallucinated indemnity caps that never appeared in the real agreement. Outside counsel flagged the error on 27% of attachment-backed runs before engineering admitted the product had no ingestion pipeline — only hope-the-LLM-reads-it.

File upload ingestion is the boundary between untrusted user bytes and your agent’s reasoning loop. It is not the same as offline RAG corpus ingestion: attachments are ephemeral, tenant-scoped, often multimodal, and must land in context within seconds under strict token budgets. This guide covers upload validation, storage tiers, parse routing by media type, chunking and summarization gates, PII redaction before model injection, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist alongside sandboxed execution when agents must process untrusted formats.

Why uploads are not “just another tool”

Many agent frameworks expose read_file(path) and stop there. Production uploads need a pipeline because user files are adversarial input surface:

Type confusion — executable content with a .pdf extension or polyglot archives.
Volume attacks — gigapixel scans, thousand-page dumps, zip bombs that expand past RAM.
Privacy leakage — SSNs, health data, and credentials embedded in tables the model quotes verbatim.
Context exhaustion — naive full-text paste evicts system prompts and prior tool results from the window.
Latency cliffs — synchronous OCR on a 200-page scan blocks the entire agent turn.

The ingestion pipeline’s job is to transform bytes into agent-safe artifacts: typed text spans, image tiles, audio transcripts, and metadata — each with provenance, token cost, and retention policy attached.

Pipeline stages: from multipart POST to context blocks

Treat ingestion as an async job graph, even when the UX feels synchronous. A reference stage order:

Receive — presigned object upload or streaming multipart; never buffer unbounded in the API process.
Validate — size cap, extension allowlist, magic-byte sniff, virus scan hook, per-tenant quota.
Classify — map to handler: text/plain, application/pdf, image/*, audio/*, tabular, unsupported.
Parse — handler-specific extractors with timeouts and page limits.
Normalize — UTF-8 text, deduplicated headers/footers, table linearization, image downscale.
Protect — PII detect + vault tokenize or redact spans before any model call.
Package — chunk, embed optional retrieval keys, attach token estimates to each block.
Inject — insert into agent context via structured attachment envelope, not raw paste.

Each stage emits events to your observability trace so failures show up as INGEST_TIMEOUT rather than “model refused.”

Validation layer: trust nothing from the client

Client-side accept=".pdf" is UX, not security. Server rules Harbor Legal enforces:

Hard size cap per file and per conversation (e.g. 25 MB / 100 MB rolling) with 413 and user-visible reason.
Magic-byte verification — reject when declared MIME disagrees with content sniff beyond a tolerance list.
Archive policy — either block zip/rar/7z entirely or extract in an isolated worker with max file count and uncompressed byte budget.
Malware scan — ClamAV or cloud AV on object close; quarantine bucket on fail.
Rate limits — uploads per user per hour tied to tenant throttles.

Store originals in object storage with server-side encryption; generate an internal attachment_id (UUID) unrelated to filename so path traversal and confusing duplicates cannot reach tools.

Parse routing: one handler per media family

A single parser guarantees wrong output on half your traffic. Route by detected type:

Type	Parser strategy	Agent context shape
Plain text / Markdown / CSV	Charset detect, row/sample limits for CSV	Text blocks with line ranges
PDF (digital)	Text layer extract per page; fallback OCR only on scanned pages	Page-indexed spans + table JSON sidecar
PDF (scan) / images	OCR with layout; vision model tiles for diagrams	Text + optional low-res image URLs for VLM turns
Office docs	Structured extract (docx/xlsx) in sandbox; never macro-enable	Heading-aware chunks
Audio	ASR with diarization optional	Timestamped transcript segments

Cap pages processed per attachment (Harbor: 150 pages digital, 40 OCR pages). Beyond the cap, ingest returns a summary stub plus “request extended processing” workflow rather than blocking silently or truncating mid-clause.

Chunking, summarization, and context budgets

Parsed text must meet the agent’s context budget before the reasoning turn starts. Patterns:

Fast path — if extracted text < 8k tokens, inject full text with page citations.
Map-reduce — chunk by heading or fixed token windows; parallel mini-summaries; merge into structured outline the main agent reads (see map-reduce doc processing).
Retrieve-on-demand — embed chunks; agent gets attachment manifest + search_attachment tool instead of full paste.
Vision routing — send image tiles only when the user question references figures, signatures, or stamps.

Every injected block carries attachment_id, page_range, token_estimate, and ingest_version so truncation middleware can drop lowest-priority spans under pressure without corrupting citations.

PII and secrets: protect before the model sees bytes

Contracts and HR uploads are PII-dense. Run detection + redaction on normalized text before context injection:

Tokenize emails, phone numbers, account IDs to vault references the agent can reason about (“Party A account [REDACTED_7]”) without echoing raw values.
Block ingestion entirely when classification hits restricted categories (PCI, PHI) if the deployment lacks compliance mode.
Log redaction counts per attachment for audit; never log raw matches.

Harbor’s pre-model redaction cut accidental PII echo in model outputs from 11% to 0.3% on attachment runs.

Attachment envelope: how tools reference files safely

Agents should not receive filesystem paths. Expose an attachment manifest in the run payload:

{
  "attachment_id": "att_8f2c…",
  "filename_display": "vendor-msa.pdf",
  "media_type": "application/pdf",
  "pages_ingested": 42,
  "token_estimate": 12400,
  "chunks": [
    { "chunk_id": "c_01", "pages": "1-3", "summary": "…" }
  ],
  "tools_allowed": ["search_attachment", "quote_attachment_span"]
}

Tools fetch chunk text by ID from an internal API scoped to the run’s tenant_id and isolation boundary. Direct cat /tmp/upload style tools are removed from production manifests.

Retention, deletion, and compliance

Attachments are not your RAG index. Default policies:

Session TTL — delete originals and parsed chunks N hours after run completion unless user pins.
Legal hold — flag prevents GC until case closes; separate bucket with stricter ACL.
Export control — block cross-region copy when tenant data residency requires it.
User delete — hard-delete object + embeddings + cache keys on GDPR erasure request.

Pair with audit trails that record who uploaded, ingest outcome, and which chunks were quoted in outbound actions.

Harbor Legal refactor

After the MSA incident, Harbor shipped five changes:

Typed ingestion workers with per-handler timeouts and page caps; zip uploads rejected at edge.
Presigned uploads to quarantine bucket; API never holds full file in memory.
Map-reduce default for PDFs > 12k tokens; full text only on user opt-in.
PII gate on all legal vertical attachments before first model token.
Attachment-scoped tools replacing generic file read; citations required on clause claims.

Attachment-backed error rate (outside counsel review) fell from 27% to 1.4%. p95 ingest latency for a 40-page digital PDF dropped from 38s (blocking OCR) to 6s (text-layer first). Support tickets tagged “wrong document” fell 81%.

Technique decision table

Approach	Strengths	Weaknesses	Best for
Paste / drag raw into prompt	Fastest prototype	No validation, no citations, context blowups	Internal demos only
Sync parse in request thread	Simple mental model	Timeouts, no retry, blocks websocket	Small text files < 1 MB
Async ingestion pipeline (recommended)	Scalable, gated, observable stages	Engineering upfront	Production agents with uploads
Pre-index to RAG only	Great for corpora	Slow for one-off chat attachments	Knowledge bases, not chat uploads
Client-side extract (browser PDF.js)	Offloads server CPU	Untrusted extract, inconsistent layout	Supplement only; server must re-validate

Common pitfalls

Trusting client MIME types — always sniff magic bytes; reject polyglots.
Synchronous OCR on uploads — blocks the agent turn; queue ingest and stream progress events.
Full-document paste — evicts tools and memory; default to manifest + search tool.
Filesystem paths in tool args — path traversal and cross-tenant reads; use attachment IDs.
Skipping PII on “internal” docs — contracts are the highest-risk class.
Infinite retention — compliance debt; TTL + legal hold is enough for most SaaS.
No ingest versioning — parser upgrades change quotes; bump ingest_version for reproducibility.

Production checklist

Presigned upload to quarantine bucket with size and rate limits.
Magic-byte verify + extension allowlist; block or sandbox archives.
Malware scan on object close before promote to parse queue.
Route parsers by media family with per-handler timeouts and page caps.
Run PII detection before any model context injection.
Package chunks with token estimates and page citations.
Expose attachment-scoped tools, not raw filesystem reads.
Default map-reduce or retrieve-on-demand for large documents.
Emit ingest spans to observability (stage, duration, outcome).
TTL-delete originals and chunks; support legal hold and erasure.

Key takeaways

Uploads are adversarial input — validate, scan, and classify before parse.
Ingestion is not RAG indexing — optimize for ephemeral, cited, budgeted context.
Parse routing beats one-size-fits-all — PDF text layer before OCR; cap pages.
PII gates belong pre-model — redact spans, not apologies after leakage.
Harbor Legal cut attachment errors from 27% to 1.4% with typed workers, chunk budgets, and attachment-scoped tools.