Guide
LLM agent file upload and attachment ingestion pipeline systems explained
Harbor Legal shipped a contract-review agent where paralegals drag PDFs
into chat and ask for clause summaries, red-flag tables, and
counterparty risk notes. Week two of production: a vendor uploaded a
48 MB archive renamed msa.pdf, the client sent the
raw bytes to a generic “read file” tool, OCR dumped
400,000 tokens of boilerplate and embedded macros into context, and the
model hallucinated indemnity caps that never appeared in the real
agreement. Outside counsel flagged the error on
27% of attachment-backed runs before engineering
admitted the product had no ingestion pipeline — only
hope-the-LLM-reads-it.
File upload ingestion is the boundary between untrusted user bytes and your agent’s reasoning loop. It is not the same as offline RAG corpus ingestion: attachments are ephemeral, tenant-scoped, often multimodal, and must land in context within seconds under strict token budgets. This guide covers upload validation, storage tiers, parse routing by media type, chunking and summarization gates, PII redaction before model injection, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist alongside sandboxed execution when agents must process untrusted formats.
Why uploads are not “just another tool”
Many agent frameworks expose read_file(path) and stop
there. Production uploads need a pipeline because user files are
adversarial input surface:
- Type confusion — executable content with
a
.pdfextension or polyglot archives. - Volume attacks — gigapixel scans, thousand-page dumps, zip bombs that expand past RAM.
- Privacy leakage — SSNs, health data, and credentials embedded in tables the model quotes verbatim.
- Context exhaustion — naive full-text paste evicts system prompts and prior tool results from the window.
- Latency cliffs — synchronous OCR on a 200-page scan blocks the entire agent turn.
The ingestion pipeline’s job is to transform bytes into agent-safe artifacts: typed text spans, image tiles, audio transcripts, and metadata — each with provenance, token cost, and retention policy attached.
Pipeline stages: from multipart POST to context blocks
Treat ingestion as an async job graph, even when the UX feels synchronous. A reference stage order:
- Receive — presigned object upload or streaming multipart; never buffer unbounded in the API process.
- Validate — size cap, extension allowlist, magic-byte sniff, virus scan hook, per-tenant quota.
- Classify — map to handler:
text/plain,application/pdf,image/*,audio/*,tabular,unsupported. - Parse — handler-specific extractors with timeouts and page limits.
- Normalize — UTF-8 text, deduplicated headers/footers, table linearization, image downscale.
- Protect — PII detect + vault tokenize or redact spans before any model call.
- Package — chunk, embed optional retrieval keys, attach token estimates to each block.
- Inject — insert into agent context via structured attachment envelope, not raw paste.
Each stage emits events to your
observability
trace so failures show up as INGEST_TIMEOUT rather than
“model refused.”
Validation layer: trust nothing from the client
Client-side accept=".pdf" is UX, not security.
Server rules Harbor Legal enforces:
- Hard size cap per file and per conversation
(e.g. 25 MB / 100 MB rolling) with
413and user-visible reason. - Magic-byte verification — reject when declared MIME disagrees with content sniff beyond a tolerance list.
- Archive policy — either block
zip/rar/7zentirely or extract in an isolated worker with max file count and uncompressed byte budget. - Malware scan — ClamAV or cloud AV on object close; quarantine bucket on fail.
- Rate limits — uploads per user per hour tied to tenant throttles.
Store originals in object storage with server-side encryption;
generate an internal attachment_id (UUID) unrelated to
filename so path traversal and confusing duplicates cannot reach tools.
Parse routing: one handler per media family
A single parser guarantees wrong output on half your traffic. Route by detected type:
| Type | Parser strategy | Agent context shape |
|---|---|---|
| Plain text / Markdown / CSV | Charset detect, row/sample limits for CSV | Text blocks with line ranges |
| PDF (digital) | Text layer extract per page; fallback OCR only on scanned pages | Page-indexed spans + table JSON sidecar |
| PDF (scan) / images | OCR with layout; vision model tiles for diagrams | Text + optional low-res image URLs for VLM turns |
| Office docs | Structured extract (docx/xlsx) in sandbox; never macro-enable | Heading-aware chunks |
| Audio | ASR with diarization optional | Timestamped transcript segments |
Cap pages processed per attachment (Harbor: 150 pages digital, 40 OCR pages). Beyond the cap, ingest returns a summary stub plus “request extended processing” workflow rather than blocking silently or truncating mid-clause.
Chunking, summarization, and context budgets
Parsed text must meet the agent’s context budget before the reasoning turn starts. Patterns:
- Fast path — if extracted text < 8k tokens, inject full text with page citations.
- Map-reduce — chunk by heading or fixed token windows; parallel mini-summaries; merge into structured outline the main agent reads (see map-reduce doc processing).
- Retrieve-on-demand — embed chunks;
agent gets attachment manifest +
search_attachmenttool instead of full paste. - Vision routing — send image tiles only when the user question references figures, signatures, or stamps.
Every injected block carries attachment_id,
page_range, token_estimate, and
ingest_version so
truncation middleware
can drop lowest-priority spans under pressure without corrupting
citations.
PII and secrets: protect before the model sees bytes
Contracts and HR uploads are PII-dense. Run detection + redaction on normalized text before context injection:
- Tokenize emails, phone numbers, account IDs to vault references the agent can reason about (“Party A account [REDACTED_7]”) without echoing raw values.
- Block ingestion entirely when classification hits restricted categories (PCI, PHI) if the deployment lacks compliance mode.
- Log redaction counts per attachment for audit; never log raw matches.
Harbor’s pre-model redaction cut accidental PII echo in model outputs from 11% to 0.3% on attachment runs.
Attachment envelope: how tools reference files safely
Agents should not receive filesystem paths. Expose an attachment manifest in the run payload:
{
"attachment_id": "att_8f2c…",
"filename_display": "vendor-msa.pdf",
"media_type": "application/pdf",
"pages_ingested": 42,
"token_estimate": 12400,
"chunks": [
{ "chunk_id": "c_01", "pages": "1-3", "summary": "…" }
],
"tools_allowed": ["search_attachment", "quote_attachment_span"]
}
Tools fetch chunk text by ID from an internal API scoped to the run’s
tenant_id and
isolation boundary.
Direct cat /tmp/upload style tools are removed from
production manifests.
Retention, deletion, and compliance
Attachments are not your RAG index. Default policies:
- Session TTL — delete originals and parsed chunks N hours after run completion unless user pins.
- Legal hold — flag prevents GC until case closes; separate bucket with stricter ACL.
- Export control — block cross-region copy when tenant data residency requires it.
- User delete — hard-delete object + embeddings + cache keys on GDPR erasure request.
Pair with audit trails that record who uploaded, ingest outcome, and which chunks were quoted in outbound actions.
Harbor Legal refactor
After the MSA incident, Harbor shipped five changes:
- Typed ingestion workers with per-handler timeouts and page caps; zip uploads rejected at edge.
- Presigned uploads to quarantine bucket; API never holds full file in memory.
- Map-reduce default for PDFs > 12k tokens; full text only on user opt-in.
- PII gate on all legal vertical attachments before first model token.
- Attachment-scoped tools replacing generic file read; citations required on clause claims.
Attachment-backed error rate (outside counsel review) fell from 27% to 1.4%. p95 ingest latency for a 40-page digital PDF dropped from 38s (blocking OCR) to 6s (text-layer first). Support tickets tagged “wrong document” fell 81%.
Technique decision table
| Approach | Strengths | Weaknesses | Best for |
|---|---|---|---|
| Paste / drag raw into prompt | Fastest prototype | No validation, no citations, context blowups | Internal demos only |
| Sync parse in request thread | Simple mental model | Timeouts, no retry, blocks websocket | Small text files < 1 MB |
| Async ingestion pipeline (recommended) | Scalable, gated, observable stages | Engineering upfront | Production agents with uploads |
| Pre-index to RAG only | Great for corpora | Slow for one-off chat attachments | Knowledge bases, not chat uploads |
| Client-side extract (browser PDF.js) | Offloads server CPU | Untrusted extract, inconsistent layout | Supplement only; server must re-validate |
Common pitfalls
- Trusting client MIME types — always sniff magic bytes; reject polyglots.
- Synchronous OCR on uploads — blocks the agent turn; queue ingest and stream progress events.
- Full-document paste — evicts tools and memory; default to manifest + search tool.
- Filesystem paths in tool args — path traversal and cross-tenant reads; use attachment IDs.
- Skipping PII on “internal” docs — contracts are the highest-risk class.
- Infinite retention — compliance debt; TTL + legal hold is enough for most SaaS.
- No ingest versioning — parser upgrades
change quotes; bump
ingest_versionfor reproducibility.
Production checklist
- Presigned upload to quarantine bucket with size and rate limits.
- Magic-byte verify + extension allowlist; block or sandbox archives.
- Malware scan on object close before promote to parse queue.
- Route parsers by media family with per-handler timeouts and page caps.
- Run PII detection before any model context injection.
- Package chunks with token estimates and page citations.
- Expose attachment-scoped tools, not raw filesystem reads.
- Default map-reduce or retrieve-on-demand for large documents.
- Emit ingest spans to observability (stage, duration, outcome).
- TTL-delete originals and chunks; support legal hold and erasure.
Key takeaways
- Uploads are adversarial input — validate, scan, and classify before parse.
- Ingestion is not RAG indexing — optimize for ephemeral, cited, budgeted context.
- Parse routing beats one-size-fits-all — PDF text layer before OCR; cap pages.
- PII gates belong pre-model — redact spans, not apologies after leakage.
- Harbor Legal cut attachment errors from 27% to 1.4% with typed workers, chunk budgets, and attachment-scoped tools.
Related reading
- RAG document ingestion explained — offline corpus indexing vs chat attachments
- LLM agent PII detection and redaction pipeline explained — protect uploads before model injection
- LLM agent context budget and token management explained — fit attachments into the window
- LLM PDF document parsing explained — text layer, OCR, and layout extraction