Guide
LLM vector metadata filtering explained
Harbor Legal’s contract-search RAG stack indexed 340,000 clause chunks across 47 enterprise clients in one shared vector index. Semantic search returned the right text — but often from the wrong client. Analysts saw competitor indemnification language in their sidebar because the pipeline relied on a post-retrieval “ignore chunks you shouldn’t see” instruction to the LLM. That is not access control; it is hope. After moving tenant and matter ACLs into pre-filtered ANN search, cross-tenant leaks dropped to zero on a 1,800-query audit set, and recall@5 on correctly scoped queries rose from 78% to 89% because the top-k budget stopped filling with irrelevant-but-similar vectors from other tenants.
Metadata filtering attaches structured fields to each stored vector — tenant ID, document type, effective date, language, clearance level — and restricts nearest-neighbor search to rows matching a filter expression before or during graph traversal. This guide explains filter semantics across major vector stores, pre-filter vs post-filter tradeoffs, schema design for multi-tenant RAG, how filters interact with hybrid BM25+vector pipelines, the Harbor Legal refactor, a technique decision table versus separate indexes and LLM-side filtering, pitfalls, and a production checklist — building on chunk metadata design and two-stage reranking.
What metadata filtering is
Every vector upsert carries a payload (metadata JSON)
alongside the embedding. At query time you supply a boolean filter —
e.g. tenant_id = "acme" AND doc_type IN ("msa", "sow") AND
effective_date >= 2024-01-01 — and the database returns
nearest neighbors only among matching rows.
Filtering solves three distinct problems:
- Authorization — users must never retrieve chunks outside their ACL; this is a security requirement, not a quality tweak.
- Scope narrowing — “search only HR policies published after 2023” improves precision without rewriting the query embedding.
- Operational partitioning — route staging vs production corpora, or archived vs active matter folders, inside one physical index.
| Approach | When filter runs | Typical use |
|---|---|---|
| Pre-filter (filter-first) | Before / during ANN graph walk | Strict ACL, small candidate sets per tenant |
| Post-filter (search-first) | After retrieving top-N neighbors | Low-selectivity filters, exploratory search |
| Partitioned collections | Separate index per scope | Large tenants, hard isolation, different embedding models |
| Application-side filter | After DB returns results | Prototyping only — not safe for ACL |
Pre-filter vs post-filter performance
Post-filter is the naive pattern: retrieve top 100 by cosine similarity, discard rows failing the metadata predicate, return what remains. When the filter is selective — one tenant among fifty sharing an index — you routinely end up with fewer than k results even though plenty of in-scope neighbors exist. Worse, the 100 global nearest neighbors may all be out-of-scope, yielding an empty set while relevant chunks sit at ranks 150–400 in-scope.
Pre-filter (supported natively in Qdrant, Weaviate, Milvus, Pinecone, pgvector with partial indexes) applies the predicate during HNSW or IVF traversal. The graph explores only eligible nodes, so top-k reflects true in-scope similarity. Cost: if the filter matches very few rows (<0.1% of the index), ANN graphs may degrade toward brute force unless the store builds payload indexes on hot filter fields.
When post-filter is acceptable
Post-filter works when selectivity is low — filtering
language = "en" on a 95% English corpus — or when
k is oversized intentionally (retrieve 500, filter to 20). It fails
for tenant ACLs on shared indexes; never use it for security boundaries.
Over-fetch multiplier
If stuck on post-filter temporarily, multiply retrieval depth by the inverse selectivity: expected in-scope fraction 2% implies fetch 50× target k before filtering. Log empty-result rate; spikes mean pre-filter migration is overdue.
Metadata schema design for RAG
Design payload fields at chunk ingest time; retrofitting ACL metadata after launch requires re-upserting every vector.
- tenant_id / org_id — single string or UUID; index with keyword payload index; required on every row.
- acl_groups — array of role or matter IDs the user must intersect; use “match any” semantics carefully and test OR vs AND.
- doc_type — enum (
policy,contract,email) for product UI facets. - source_id + chunk_index — join keys back to parent documents for citation rendering.
- effective_date / ingested_at — range filters for temporal queries; store as epoch or ISO strings consistently.
- language — ISO 639-1; pairs with multilingual routing guides.
- is_deleted / tombstone — soft-delete without
immediate reindex; filter
is_deleted = falseon every query.
Keep payloads small. HNSW stores payload data per node; bloated JSON (full parent paragraphs duplicated) inflates RAM. Store citation text in object storage or Postgres; the vector payload holds pointers only.
Align filter fields with query routing: if the router detects “2024 MSAs only,” emit structured filter JSON rather than hoping the embedding encodes the date constraint.
Store-specific filter patterns
Qdrant
Payload indexes on keyword, integer, and datetime fields. Filter DSL supports nested must/should/must_not. Pre-filter is default when indexes exist. Create tenant payload index before bulk ingest on multi-tenant workloads.
pgvector
Combine ORDER BY embedding <=> query LIMIT k with
WHERE tenant_id = $1. Partial indexes on
(tenant_id) WHERE NOT is_deleted help. IVFFlat lists must be
rebuilt after major tenant data shifts; HNSW via pgvector 0.5+ improves
filtered recall.
Pinecone / managed services
Metadata filters in query API; namespaces provide hard partition per tenant at the cost of cross-tenant analytics. Prefer namespaces when tenants are large and ACL is strictly single-tenant; shared index + payload filter when tenants are small and numerous.
Hybrid retrieval
In BM25+vector fusion, apply identical filters to both legs before reciprocal rank fusion. Filtering only the dense leg lets lexical hits leak out-of-scope documents into merged results.
Harbor Legal multi-client refactor
Before refactor, Harbor Legal ran one Qdrant collection for all clients.
Queries fetched top 40 globally; a Python middleware dropped rows whose
matter_id was not in the user’s session list, then passed
the remainder to a
cross-encoder reranker.
Problems accumulated:
- 38% of queries returned fewer than 5 chunks after post-filter despite in-scope content existing.
- Two audit queries surfaced redacted competitor clauses into the LLM context (model ignored the system prompt; answer was not returned to user, but context window was contaminated).
- Cross-encoder latency spiked because reranker scored irrelevant global hits first.
Refactor steps:
- Added keyword payload indexes on
tenant_idandmatter_id. - Moved ACL expression into Qdrant pre-filter on every query.
- Increased ANN
efparameter 64 → 128 for filtered searches (small graph subgraphs need wider beam). - Synced BM25 leg in OpenSearch with identical
tenant_idterm filter. - Added integration tests: synthetic queries per tenant must never return
foreign
tenant_idin top 50.
Outcomes: scoped recall@5 78% → 89%; reranker input size 40 → 20 vectors (same latency budget); zero ACL violations on audit set.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Multi-tenant ACL on shared index | Pre-filter with payload indexes | Post-filter + LLM “ignore” instructions |
| Single large tenant (>40% of vectors) | Dedicated collection or namespace | Shared index with heavy pre-filter overhead |
| Soft facet (“prefer recent”) | Post-filter or reranker feature | Hard date cutoff that hides relevant older docs |
| Filter matches <0.05% of rows | Partitioned index per scope | Global HNSW pre-filter without tuning |
| Hybrid BM25 + vector | Same filter on both legs | Vector-only filtering |
| Staging vs production corpus | env payload field + mandatory filter |
Separate clusters (cost) unless compliance requires |
| Complex OR ACL (any of 200 matters) | Flatten to allowed ID set server-side | 200-clause OR in every query without caching |
Metadata filtering complements language routing and drift monitors — filters narrow the candidate set; they do not fix wrong embeddings.
Pitfalls
- LLM as access control — never rely on the model to discard unauthorized chunks; prompts leak in logs and cache layers.
- Missing payload indexes — pre-filter without indexes on high-cardinality fields forces full scans.
- Inconsistent filter on hybrid legs — lexical results bypass vector ACL.
- Stale tombstones — deleted documents still embedded until reindex jobs complete; filter tombstones at query time.
- Over-selective date filters — users miss governing docs signed before a arbitrary cutoff.
- Client-supplied filter injection — treat filter JSON as server-computed from session auth, not raw user input.
- Under-fetch with post-filter — empty results that look like “no knowledge” are often a retrieval bug.
- Namespace sprawl — thousands of empty Pinecone namespaces complicate ops; consolidate small tenants.
Production checklist
- Define required payload fields at ingest; reject upserts missing ACL keys.
- Build payload / partial indexes on every field used in pre-filters.
- Compute filters server-side from authenticated session — never trust the client.
- Use pre-filter for authorization boundaries; post-filter only for soft facets.
- Apply identical filters to dense and sparse retrieval legs before fusion.
- Benchmark recall@k under filter with per-tenant golden queries.
- Load-test filtered ANN p95; tune
ef/probeswhen selectivity is high. - Alert on empty-result rate and post-filter shrinkage ratio.
- Integration-test cross-tenant isolation on every deploy.
- Document filter schema versioning; migrate payloads before enabling new required fields.
Key takeaways
- Metadata filtering scopes ANN search to rows matching structured predicates.
- Pre-filter is mandatory for ACL; post-filter alone wastes top-k budget and leaks context.
- Payload indexes and tuned ANN params matter when filters are selective.
- Harbor Legal recall@5 rose 11 points after pre-filter replaced post-filter ACL.
- Hybrid pipelines must filter BM25 and vector legs identically.
Related reading
- Vector databases explained — ANN indexes and baseline metadata support
- Hybrid search explained — apply filters before BM25+vector fusion
- LLM reranking explained — stage-two scoring on filtered candidates
- RAG chunking strategies explained — attach metadata at chunk ingest