Guide

LLM vector metadata filtering explained

Harbor Legal’s contract-search RAG stack indexed 340,000 clause chunks across 47 enterprise clients in one shared vector index. Semantic search returned the right text — but often from the wrong client. Analysts saw competitor indemnification language in their sidebar because the pipeline relied on a post-retrieval “ignore chunks you shouldn’t see” instruction to the LLM. That is not access control; it is hope. After moving tenant and matter ACLs into pre-filtered ANN search, cross-tenant leaks dropped to zero on a 1,800-query audit set, and recall@5 on correctly scoped queries rose from 78% to 89% because the top-k budget stopped filling with irrelevant-but-similar vectors from other tenants.

Metadata filtering attaches structured fields to each stored vector — tenant ID, document type, effective date, language, clearance level — and restricts nearest-neighbor search to rows matching a filter expression before or during graph traversal. This guide explains filter semantics across major vector stores, pre-filter vs post-filter tradeoffs, schema design for multi-tenant RAG, how filters interact with hybrid BM25+vector pipelines, the Harbor Legal refactor, a technique decision table versus separate indexes and LLM-side filtering, pitfalls, and a production checklist — building on chunk metadata design and two-stage reranking.

What metadata filtering is

Every vector upsert carries a payload (metadata JSON) alongside the embedding. At query time you supply a boolean filter — e.g. tenant_id = "acme" AND doc_type IN ("msa", "sow") AND effective_date >= 2024-01-01 — and the database returns nearest neighbors only among matching rows.

Filtering solves three distinct problems:

Authorization — users must never retrieve chunks outside their ACL; this is a security requirement, not a quality tweak.
Scope narrowing — “search only HR policies published after 2023” improves precision without rewriting the query embedding.
Operational partitioning — route staging vs production corpora, or archived vs active matter folders, inside one physical index.

Approach	When filter runs	Typical use
Pre-filter (filter-first)	Before / during ANN graph walk	Strict ACL, small candidate sets per tenant
Post-filter (search-first)	After retrieving top-N neighbors	Low-selectivity filters, exploratory search
Partitioned collections	Separate index per scope	Large tenants, hard isolation, different embedding models
Application-side filter	After DB returns results	Prototyping only — not safe for ACL

Pre-filter vs post-filter performance

Post-filter is the naive pattern: retrieve top 100 by cosine similarity, discard rows failing the metadata predicate, return what remains. When the filter is selective — one tenant among fifty sharing an index — you routinely end up with fewer than k results even though plenty of in-scope neighbors exist. Worse, the 100 global nearest neighbors may all be out-of-scope, yielding an empty set while relevant chunks sit at ranks 150–400 in-scope.

Pre-filter (supported natively in Qdrant, Weaviate, Milvus, Pinecone, pgvector with partial indexes) applies the predicate during HNSW or IVF traversal. The graph explores only eligible nodes, so top-k reflects true in-scope similarity. Cost: if the filter matches very few rows (<0.1% of the index), ANN graphs may degrade toward brute force unless the store builds payload indexes on hot filter fields.

When post-filter is acceptable

Post-filter works when selectivity is low — filtering language = "en" on a 95% English corpus — or when k is oversized intentionally (retrieve 500, filter to 20). It fails for tenant ACLs on shared indexes; never use it for security boundaries.

Over-fetch multiplier

If stuck on post-filter temporarily, multiply retrieval depth by the inverse selectivity: expected in-scope fraction 2% implies fetch 50× target k before filtering. Log empty-result rate; spikes mean pre-filter migration is overdue.

Metadata schema design for RAG

Design payload fields at chunk ingest time; retrofitting ACL metadata after launch requires re-upserting every vector.

tenant_id / org_id — single string or UUID; index with keyword payload index; required on every row.
acl_groups — array of role or matter IDs the user must intersect; use “match any” semantics carefully and test OR vs AND.
doc_type — enum (policy, contract, email) for product UI facets.
source_id + chunk_index — join keys back to parent documents for citation rendering.
effective_date / ingested_at — range filters for temporal queries; store as epoch or ISO strings consistently.
language — ISO 639-1; pairs with multilingual routing guides.
is_deleted / tombstone — soft-delete without immediate reindex; filter is_deleted = false on every query.

Keep payloads small. HNSW stores payload data per node; bloated JSON (full parent paragraphs duplicated) inflates RAM. Store citation text in object storage or Postgres; the vector payload holds pointers only.

Align filter fields with query routing: if the router detects “2024 MSAs only,” emit structured filter JSON rather than hoping the embedding encodes the date constraint.

Store-specific filter patterns

Qdrant

Payload indexes on keyword, integer, and datetime fields. Filter DSL supports nested must/should/must_not. Pre-filter is default when indexes exist. Create tenant payload index before bulk ingest on multi-tenant workloads.

pgvector

Combine ORDER BY embedding <=> query LIMIT k with WHERE tenant_id = $1. Partial indexes on (tenant_id) WHERE NOT is_deleted help. IVFFlat lists must be rebuilt after major tenant data shifts; HNSW via pgvector 0.5+ improves filtered recall.

Pinecone / managed services

Metadata filters in query API; namespaces provide hard partition per tenant at the cost of cross-tenant analytics. Prefer namespaces when tenants are large and ACL is strictly single-tenant; shared index + payload filter when tenants are small and numerous.

Hybrid retrieval

In BM25+vector fusion, apply identical filters to both legs before reciprocal rank fusion. Filtering only the dense leg lets lexical hits leak out-of-scope documents into merged results.

Harbor Legal multi-client refactor

Before refactor, Harbor Legal ran one Qdrant collection for all clients. Queries fetched top 40 globally; a Python middleware dropped rows whose matter_id was not in the user’s session list, then passed the remainder to a cross-encoder reranker. Problems accumulated:

38% of queries returned fewer than 5 chunks after post-filter despite in-scope content existing.
Two audit queries surfaced redacted competitor clauses into the LLM context (model ignored the system prompt; answer was not returned to user, but context window was contaminated).
Cross-encoder latency spiked because reranker scored irrelevant global hits first.

Refactor steps:

Added keyword payload indexes on tenant_id and matter_id.
Moved ACL expression into Qdrant pre-filter on every query.
Increased ANN ef parameter 64 → 128 for filtered searches (small graph subgraphs need wider beam).
Synced BM25 leg in OpenSearch with identical tenant_id term filter.
Added integration tests: synthetic queries per tenant must never return foreign tenant_id in top 50.

Outcomes: scoped recall@5 78% → 89%; reranker input size 40 → 20 vectors (same latency budget); zero ACL violations on audit set.

Technique decision table

Scenario	Prefer	Avoid
Multi-tenant ACL on shared index	Pre-filter with payload indexes	Post-filter + LLM “ignore” instructions
Single large tenant (>40% of vectors)	Dedicated collection or namespace	Shared index with heavy pre-filter overhead
Soft facet (“prefer recent”)	Post-filter or reranker feature	Hard date cutoff that hides relevant older docs
Filter matches <0.05% of rows	Partitioned index per scope	Global HNSW pre-filter without tuning
Hybrid BM25 + vector	Same filter on both legs	Vector-only filtering
Staging vs production corpus	`env` payload field + mandatory filter	Separate clusters (cost) unless compliance requires
Complex OR ACL (any of 200 matters)	Flatten to allowed ID set server-side	200-clause OR in every query without caching

Metadata filtering complements language routing and drift monitors — filters narrow the candidate set; they do not fix wrong embeddings.

Pitfalls

LLM as access control — never rely on the model to discard unauthorized chunks; prompts leak in logs and cache layers.
Missing payload indexes — pre-filter without indexes on high-cardinality fields forces full scans.
Inconsistent filter on hybrid legs — lexical results bypass vector ACL.
Stale tombstones — deleted documents still embedded until reindex jobs complete; filter tombstones at query time.
Over-selective date filters — users miss governing docs signed before a arbitrary cutoff.
Client-supplied filter injection — treat filter JSON as server-computed from session auth, not raw user input.
Under-fetch with post-filter — empty results that look like “no knowledge” are often a retrieval bug.
Namespace sprawl — thousands of empty Pinecone namespaces complicate ops; consolidate small tenants.

Production checklist

Define required payload fields at ingest; reject upserts missing ACL keys.
Build payload / partial indexes on every field used in pre-filters.
Compute filters server-side from authenticated session — never trust the client.
Use pre-filter for authorization boundaries; post-filter only for soft facets.
Apply identical filters to dense and sparse retrieval legs before fusion.
Benchmark recall@k under filter with per-tenant golden queries.
Load-test filtered ANN p95; tune ef / probes when selectivity is high.
Alert on empty-result rate and post-filter shrinkage ratio.
Integration-test cross-tenant isolation on every deploy.
Document filter schema versioning; migrate payloads before enabling new required fields.

Key takeaways

Metadata filtering scopes ANN search to rows matching structured predicates.
Pre-filter is mandatory for ACL; post-filter alone wastes top-k budget and leaks context.
Payload indexes and tuned ANN params matter when filters are selective.
Harbor Legal recall@5 rose 11 points after pre-filter replaced post-filter ACL.
Hybrid pipelines must filter BM25 and vector legs identically.