Guide

LLM SPLADE sparse retrieval explained

Harbor Support indexed 22,000 internal runbooks with BGE-large dense vectors and classic BM25 side by side. On a 640-query eval set heavy with acronyms, error codes, and product SKUs — “SSO SAML timeout 502,” “invoice INV-8842 reversal,” “GDPR DSAR export SLA” — dense-only recall@10 was 74% and BM25-only was 69%. Dense search paraphrased well but smeared exact tokens; BM25 matched literals but missed that “single sign-on” should retrieve “SAML federation” pages. Adding SPLADE v2 as the sparse leg — a transformer that learns which vocabulary terms to activate per document and query — lifted hybrid recall@10 to 88% while keeping sub-40 ms p95 retrieval on Elasticsearch, because scoring still runs on an inverted index, not a brute-force vector scan.

SPLADE (Sparse Lexical and Expansion model) sits in the middle of the retrieval spectrum: more expressive than hand-tuned BM25, cheaper at scale than cross-encoders or ColBERT late interaction, and complementary to dense embedding search in hybrid RAG pipelines. This guide explains how learned sparse retrieval works, SPLADE v1 vs v2, indexing and query encoding, fusion with dense results, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Where SPLADE fits in the retrieval stack

Production RAG usually runs a first-stage retriever that returns 20–100 candidate chunks, then optionally a cross-encoder reranker before the LLM reads context. First-stage options fall into three families:

  • Lexical (BM25) — fast, exact-token friendly, brittle to paraphrase and synonymy.
  • Dense (bi-encoder) — semantic similarity in embedding space, weak on rare IDs and numbers unless seen in training.
  • Sparse neural (SPLADE, uniCOIL) — learns term importance and expansion into a high-dimensional but mostly empty weight vector, scored with inverted-index engines like Elasticsearch or Lucene.

SPLADE keeps the infrastructure virtues of keyword search — postings lists, skip lists, filterable metadata — while letting the model decide that a document about “OAuth refresh token rotation” should also activate weights on “JWT,” “session,” and “revocation” even if those words never appear verbatim in the chunk.

How SPLADE encoding works

SPLADE starts from a masked language model backbone (typically DistilBERT or similar). For each input token, the model predicts weights over the entire vocabulary. Those weights pass through a ReLU and log-saturation step so only a small subset of terms get non-zero scores — typically 50–150 active dimensions per query or document, out of 30,000+.

Document indexing

At index time, each chunk is tokenized, run through the SPLADE encoder, and the per-token vocabulary activations are max-pooled into one sparse document vector. Non-zero (term_id, weight) pairs are written to an inverted index exactly like BM25 postings, often with learned weights replacing TF-IDF.

Query encoding

Queries use the same encoder. Because the representation is sparse, query-time inference is a single small forward pass — usually 5–15 ms on CPU — then dot-product scoring against posting lists. No need to embed the whole corpus at query time unlike cross-encoders.

SPLADE v1 vs v2

SPLADE v1 introduced the sparse expansion idea with distillation from a cross-encoder teacher. SPLADE v2 adds regularization so document vectors stay sparser and more efficient, improving both index size and latency. Production deployments today overwhelmingly use v2 checkpoints (e.g. naver/splade-v2 family) unless you are reproducing research baselines.

SPLADE vs BM25 vs dense embeddings

Signal Strength Weakness
BM25 Zero ML inference, predictable on SKUs and codes No learned synonymy; brittle stemming choices
Dense bi-encoder Paraphrase and conceptual match Pooling blurs rare tokens; ANN recall tuning
SPLADE Learned expansion + exact-term scoring Extra encode step; vocabulary tied to model
ColBERT / cross-encoder Highest precision per candidate Latency, memory, or expensive pre-index

SPLADE does not replace dense search in most corpora — it replaces or augments BM25 as the sparse leg. Harbor Support kept dense for conceptual questions (“how do I offboard a contractor”) and let SPLADE carry acronym and compliance-code queries where lexical grounding mattered.

Hybrid pipeline: SPLADE + dense + fusion

A typical three-index pattern:

  1. Dense HNSW index on bi-encoder embeddings (BGE, E5, etc.).
  2. Sparse index on SPLADE weights in Elasticsearch/OpenSearch (rank_features or learned-sparse plugins) or a dedicated engine.
  3. Fusion layer merging top-k from each with reciprocal rank fusion (RRF) or a weighted linear combo — RRF is robust when score scales differ.

Fetch 50–100 from each leg, fuse to 30–50 unique chunk IDs, optionally rerank top 15 with a cross-encoder, pass top 5–8 to the LLM. Pair with sensible chunking and contextual enrichment so SPLADE’s term activations land on self-contained passages.

Index refresh: SPLADE document vectors must be recomputed when chunks change — budget roughly 2–4× BM25 indexing CPU because of the transformer forward pass. Batch on ingest workers; cache encodings by content hash.

Case study: Harbor Support KB refactor

Problem. Tier-1 agents searched a unified KB for billing, identity, and privacy runbooks. BM25 + dense hybrid under-retrieved on machine-oriented queries where the user typed an error code or internal ticket macro name.

Change. The team swapped BM25 for SPLADE v2 on the same Elasticsearch cluster, kept BGE-large dense vectors in a parallel index, and fused with RRF (k=60). They added a lightweight query router: if the query matched /[A-Z]{2,}-?\d+/ or contained known product codes, SPLADE weight in fusion doubled.

Results. Recall@10 on the acronym-heavy eval slice rose from 71% to 88%. Median retrieval latency went from 28 ms to 34 ms (SPLADE encode + fusion). Reranker calls dropped 18% because the fused first stage was cleaner. Index size grew 1.6× versus BM25-only — still far smaller than storing ColBERT token vectors per chunk.

Lesson. SPLADE paid off where lexical grounding was the bottleneck; purely conceptual FAQs saw marginal gain. Measure per query cluster before mandating SPLADE everywhere.

Technique decision table

Your situation Prefer Avoid
Small FAQ, mostly natural-language questions BM25 + dense hybrid SPLADE ops overhead for tiny corpora
Docs heavy with codes, SKUs, regulations SPLADE + dense + RRF Dense-only first stage
Sub-10 ms retrieval, no GPU on ingest BM25 + dense SPLADE batch encoding without capacity
Elasticsearch already in stack SPLADE via sparse vector fields Standalone exotic index if ops cost matters
Need highest precision on 200K+ legal clauses ColBERT or two-stage rerank SPLADE alone for token-level alignment
Multilingual corpus Multilingual dense + language-specific sparse English-only SPLADE checkpoint on mixed docs
Frequent doc updates Content-hash incremental SPLADE encode Full corpus nightly re-encode

Common pitfalls

  • Treating SPLADE as a drop-in for dense search. It is a better sparse leg, not a semantic replacement.
  • Skipping fusion. SPLADE + dense together beat either alone on mixed query traffic.
  • English SPLADE on non-English docs. Activations land on wrong subwords; use language-aware routing.
  • Oversized chunks. SPLADE max-pools over tokens; 2,000-token chunks dilute rare-term weights.
  • Ignoring index bloat. Monitor non-zero weights per document; sparsity regularization matters.
  • Normalizing SPLADE and BM25 scores directly. Use RRF or rank-based fusion, not raw score addition.
  • No eval split by query type. Aggregate recall hides acronym-cluster wins and FAQ regressions.
  • Cold-start ingest without batching. SPLADE encoding can bottleneck indexing pipelines.

Production checklist

  • Benchmark BM25-only, dense-only, and SPLADE+dense on query clusters (conceptual vs code/SKU).
  • Pick SPLADE v2 checkpoint matched to domain language.
  • Store sparse weights in an inverted-index engine with metadata filters.
  • Fuse sparse and dense with RRF; tune k on a held-out set.
  • Keep chunks 256–512 tokens for stable term activation.
  • Batch document encoding on ingest; cache by content hash.
  • Monitor p95 query encode latency and posting-list scan time separately.
  • Add optional cross-encoder rerank only on fused top 15–20.
  • Track index size and non-zero weights per document over time.
  • Re-encode only changed chunks on update.
  • Document when query router boosts sparse vs dense weight.

Key takeaways

  • SPLADE learns sparse vocabulary activations per document and query, combining neural synonymy with inverted-index efficiency.
  • It replaces or augments BM25 in hybrid RAG — not dense embeddings — and shines on acronym, code, and SKU-heavy corpora.
  • Fuse SPLADE with dense retrieval via RRF; measure per query cluster before rolling out.
  • Harbor Support lifted recall@10 from 71% to 88% on literal-heavy queries with only modest latency cost.
  • Pair SPLADE with good chunking and optional reranking; sparse neural retrieval is a first-stage tool, not the whole pipeline.

Related reading