Explainer · 7 June 2026

How inverted indexes and full-text search work

You type three words into a search box and get ranked results from billions of pages in under 200 milliseconds. A relational database running WHERE body LIKE '%solana%wallet%' on the same corpus would scan every row and still miss stemmed variants, typos, and relevance ordering. The trick is not faster disks — it is a different data structure. A forward index maps each document to the terms it contains; an inverted index maps each term to the documents that contain it. Full-text search engines are built almost entirely on inverted indexes plus scoring functions that rank matches instead of merely listing them.

Forward vs inverted: the core flip

Imagine a library catalog stored two ways:

Forward — book ID → list of words in that book. To find every book mentioning "consensus," walk every book and read every page.
Inverted — word → list of book IDs (and positions) where that word appears. Look up "consensus" once, get a short posting list.

Lookup cost becomes proportional to the number of documents matching the query term, not the total corpus size. That is why Google, Elasticsearch, Solr, and the search box in your issue tracker all precompute inverted structures at index time and answer queries by intersecting or merging small posting lists at query time.

This is complementary to, not a replacement for, B-tree indexes on primary keys or foreign keys. SQL indexes excel at exact equality and range predicates on structured columns; inverted indexes excel at unstructured text, fuzzy matching, and relevance-ranked retrieval over token sequences.

Tokenization: from raw text to index terms

Before indexing, raw text passes through an analyzer pipeline:

Character filters — strip HTML tags, normalize Unicode, expand contractions if configured.
Tokenizers — split on whitespace and punctuation boundaries. Solana-wallet might become one token or two depending on rules.
Token filters — lowercase, remove stop words ("the", "a", "is"), apply stemming ("running" → "run") or lemmatization, optionally expand synonyms.

Analyzer choice is not cosmetic: the same document indexed with different analyzers produces different posting lists. English prose typically uses aggressive stemming; product SKUs and wallet addresses need keyword tokenizers that preserve exact strings. Mismatched analyzers at index vs query time are a common reason searches return zero hits despite visible text on the page.

Posting lists and positions

Each unique term in the vocabulary points to a posting list: an ordered sequence of entries, traditionally sorted by document ID for fast intersection. Each entry records at minimum which document contains the term; most engines also store:

Term frequency (tf) — how many times the term appears in that document (used in ranking).
Positions — ordinal offsets of each occurrence within the document (enables phrase queries like "proof of stake").
Payloads — optional per-occurrence metadata (font size, section tag) for boosting titles over footnotes.

Posting lists for rare terms are tiny; lists for common terms ("blockchain", "data") can span millions of documents. Engines compress lists with delta encoding (store gaps between doc IDs instead of absolute IDs) and variable-byte integer codecs so sequential scans stay cache-friendly — the same locality principles that make CPU cache layout matter for index traversal.

Boolean retrieval: AND, OR, NOT

The simplest query model treats terms as sets of document IDs:

solana AND wallet — intersect two posting lists (merge walk on sorted IDs).
solana OR ethereum — union lists, deduplicating IDs.
solana NOT scam — subtract the scam list from the solana list.

Intersection is the expensive operation when one term is very common. Optimizers reorder clauses to evaluate the rarest term first, shrinking working sets early. Phrase queries add a position check: after intersecting document IDs for each word, verify that positions of word B equal positions of word A plus one (with optional slop for "near" matches).

Pure boolean search returns an unordered bag of matches. Users expect Google-style ranked results, so production systems layer a scoring function on top of the boolean candidate set.

Ranking: TF-IDF and BM25

Early vector-space models score each document with term frequency × inverse document frequency (TF-IDF):

TF — more occurrences of the query term in a document usually means more relevance (with diminishing returns).
IDF — rare terms across the corpus get higher weight; everyone mentions "the," so it discriminates poorly.

Lucene's default since version 6 is BM25, a probabilistic variant that saturates term frequency (the 50th repetition of "wallet" does not help as much as the 2nd) and normalizes for document length so a 10,000-word whitepaper is not automatically more relevant than a focused paragraph. Field boosts multiply scores — a match in title often counts more than the same token buried in body.

BM25 answers "which documents mention my terms prominently?" It does not understand semantics: "automobile" and "car" are different tokens unless synonym filters bridge them. Modern systems add dense vector search (embeddings) alongside inverted indexes for hybrid retrieval — similar in spirit to how recommendation systems combine keyword filters with embedding similarity.

Index structure at scale: segments and merges

Apache Lucene (the engine inside Elasticsearch and Solr) writes immutable segments — self-contained mini-indexes appended on each batch of new documents. A query fans out to all live segments and merges their hit lists. Background segment merges compact many small segments into fewer large ones, trading write amplification for faster reads.

Deletes are usually logical: documents are marked tombstoned in a live bitset rather than rewritten immediately. Periodic merges purge deleted docs. Updates are implemented as delete-plus-reinsert because postings are immutable once written. This append-only design mirrors log-structured storage and plays well with SSD sequential write patterns.

Distributed search clusters shard the inverted index horizontally — each shard holds a subset of documents. A coordinator node broadcasts the query, collects top-k hits from each shard, and performs a final merge. Shard count is a capacity planning knob: too few shards limit parallelism; too many multiply coordination overhead.

What inverted indexes cannot do well

Substring and regex — *lana wildcard queries devolve into scanning the term dictionary unless n-gram indexes are maintained separately (space-heavy).
Structured joins — "users who bought SKU X and live in Texas" is still a SQL job unless you denormalize into the search document.
Strong consistency on write — near-real-time search means documents appear after a refresh interval, not in the same transaction as your OLTP insert.
Exact aggregations at petabyte scale — search engines compute facets, but heavy analytics often still flow to column stores.

For membership tests at scale without storing full lists, Bloom filters sometimes guard whether a term can possibly exist in a shard before touching disk — another layer in the same performance toolbox.

Practical checklist

Match analyzers at index and query time; test stemming on your domain vocabulary.
Store filterable metadata as keyword fields, searchable prose as text fields.
Put high-signal text (titles, headings) in boosted fields.
Monitor segment count and merge backlog — read latency creeps when merges lag.
Use keyword search for lexical match; add vectors when users need semantic similarity.
Keep primary-key lookups on B-trees; reserve inverted indexes for unstructured text.