Guide
Elasticsearch fundamentals explained
Your product catalog has 2.4 million SKUs. Users type “wireless noise
canceling headphones” and Postgres ILIKE '%headphones%' scans
the table for eight seconds. You need ranked, typo-tolerant full-text search with
faceted filters — price range, brand, rating — without melting the primary database.
Elasticsearch is a distributed search and analytics engine built
on Apache Lucene: JSON documents in, relevance-scored results out,
with aggregations for dashboards and log analytics at the same time. Wikipedia,
GitHub, and Stack Overflow route search through Elasticsearch or its fork OpenSearch.
It is not a replacement for
PostgreSQL
as your system of record — it is a specialized index optimized for text retrieval,
often fed by change streams from
Kafka or CDC
pipelines. This guide covers inverted indexes, indices and mappings, analyzers,
the Query DSL, aggregations, shard and replica topology, bulk indexing,
near-real-time refresh semantics, an e-commerce search worked example, when
Elasticsearch beats Postgres full-text or
vector semantic search,
pitfalls, and a production checklist.
What Elasticsearch is
Elasticsearch stores data as JSON documents grouped into
indices (roughly analogous to database tables). Each document
has a unique _id. Under the hood, Lucene builds an
inverted index: for every term (token), a sorted list of
document IDs where that term appears. Lookup is O(1) per term instead of
scanning every row — the core reason search is fast at scale.
Search plus analytics in one cluster
The same cluster that powers product search also runs aggregations — histograms, terms buckets, percentiles — over millions of log lines or metrics. That dual role made Elasticsearch the default for observability stacks before dedicated column stores matured. Today many teams use it primarily for search and ship logs to cheaper object storage, but the aggregation engine remains a first-class feature.
Near-real-time, not instantaneous
New documents are searchable after a refresh (default every
one second), not the instant you index them. A refresh opens a new Lucene
segment. Heavy indexing workloads often increase refresh_interval
to 30s or disable auto-refresh during bulk loads, then force a refresh when
done. Treat Elasticsearch as eventually searchable, not a
strongly consistent primary store.
Cluster topology: nodes, shards, and replicas
An Elasticsearch cluster is a set of nodes sharing the same cluster name. Node roles include master-eligible (cluster state), data (store shards), ingest (pre-index pipelines), and coordinating (route requests). Small clusters often run combined nodes; production separates master and data for stability.
Primary shards and replicas
Each index is split into primary shards (fixed at index creation; cannot shrink without reindexing). Replica shards are copies of primaries for read scaling and failover. A query hits all relevant shards in parallel; the coordinating node merges scores and returns the top hits. Too few shards under-utilizes hardware; too many creates coordination overhead — a common rule of thumb is target shard sizes of 10–50 GB, but measure your own data.
Routing and document placement
Document shard = hash(_routing) % num_primary_shards. Default routing
is the document _id. Custom routing (e.g., tenant_id)
co-locates a tenant’s documents on one shard for faster filtered queries,
but risks hot shards if one tenant dominates traffic.
Mappings, field types, and analyzers
A mapping defines field names, types, and how text is processed. Dynamic mapping guesses types on first sight of a field — convenient in development, dangerous in production when a typo creates a new field with the wrong type forever.
Common field types
text— analyzed full-text; not used for sorting or exact match.keyword— whole-value tokens for filters, aggregations, and sorting.date,long,double,boolean— numeric and temporal with range query support.nested— array of objects indexed as separate hidden documents (preserves object boundaries).dense_vector— float arrays for kNN similarity (pairs with embedding pipelines).
The multi-field pattern is standard: a title field
as text for search plus title.keyword for exact sort and
aggregation.
Analyzers: index time vs search time
An analyzer chains a character filter, tokenizer
(splits text), and token filters (lowercase, stemming, stop words).
At index time, “Running Shoes” becomes tokens [run, shoe].
At search time, the query analyzer must match — mismatched analyzers cause missed
hits. Use the _analyze API to debug tokenization before shipping a
mapping change.
The Query DSL
Elasticsearch queries are JSON sent to GET /index/_search. The
Query DSL separates query context (scoring
relevance) from filter context (yes/no, cacheable, no scoring).
Core query types
match— full-text search with analyzer on the query string; useoperator: "and"for stricter matching.match_phrase— token sequence with optionalslopfor word gaps.multi_match— search acrosstitle^3,descriptionwith field boosts.bool— composemust,should,must_not, andfilterclauses.term/terms— exact match onkeywordfields (not analyzed).range— numeric or date bounds; filters on price andin_stockbelong here.
Relevance scoring with BM25
Default text scoring uses BM25 — term frequency saturates (repeating
a word helps less after a point), rare terms score higher, and shorter fields
score higher per term. Tune with boost on fields, function_score
for business rules (boost in-stock items), or rescore with a secondary
query on the top N hits. For deep relevance tuning, collect click logs and iterate
with learning-to-rank plugins — but start with sensible boosts before ML.
Aggregations
Wrap a query in aggs to bucket results: terms on
brand.keyword, range on price,
date_histogram on created_at. Aggregations run on
the same shard data as search — faceted navigation (brand checkboxes beside
search results) is a single round trip.
Indexing documents at scale
Single-document POST /index/_doc is fine for low volume. Production
bulk loads use the Bulk API — newline-delimited action/metadata
pairs batched in megabyte-sized requests. Set refresh_interval: -1
during import, increase index.number_of_replicas to 0 temporarily,
then restore settings and call _forcemerge only when you understand
the I/O cost.
Ingest pipelines
Ingest nodes run processors before indexing: grok to
parse log lines, geoip enrichment, set to add fields,
remove to drop PII. Pipelines decouple transformation from application
code — your service publishes raw events; Elasticsearch normalizes them.
Keeping the index in sync
The canonical pattern: Postgres remains source of truth; an outbox or CDC stream
publishes changes to Kafka; a consumer indexes updates into Elasticsearch. On
failure, replay from the log. Periodic full reindex to a new index alias swap
(products_v2 → alias products) fixes mapping mistakes
without downtime.
Worked example: e-commerce product search
Harbor Market sells 180,000 products. Requirements: sub-200ms search, faceted brand and category filters, price sort, typo tolerance on titles.
Mapping sketch
{
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "english",
"fields": { "keyword": { "type": "keyword" } } },
"description": { "type": "text" },
"brand": { "type": "keyword" },
"category": { "type": "keyword" },
"price": { "type": "scaled_float", "scaling_factor": 100 },
"rating": { "type": "half_float" },
"in_stock": { "type": "boolean" }
}
}
}
Search request
GET /products/_search
{
"query": {
"bool": {
"must": [{ "multi_match": {
"query": "wireless headphones",
"fields": ["title^3", "description"],
"fuzziness": "AUTO"
}}],
"filter": [
{ "term": { "in_stock": true }},
{ "range": { "price": { "lte": 299 }}}
]
}
},
"aggs": {
"brands": { "terms": { "field": "brand", "size": 20 }}
},
"sort": [{ "rating": "desc" }, "_score"],
"size": 24
}
Filters run in filter context (cached). Fuzziness handles “headphnes” typos. Aggregations populate the sidebar brand facet. Result latency: 45ms p95 on a three-node cluster with two primary shards and one replica each.
When to use Elasticsearch
| Need | Elasticsearch | Alternative |
|---|---|---|
| Ranked full-text over millions of docs | Core strength — BM25, analyzers, facets | Postgres tsvector works to ~low millions with tuning |
| Meaning-based “similar items” search | dense_vector kNN + hybrid with BM25 |
Dedicated vector database |
| Transactional CRUD source of truth | Poor fit — no multi-doc ACID | Postgres, MongoDB |
| Log analytics at petabyte scale | Works; expensive at volume | ClickHouse, BigQuery, Loki |
| License-sensitive deployments | Elastic License 2.0 restrictions on managed offerings | OpenSearch (Apache 2.0 fork) |
| Simple autocomplete on 10k rows | Overkill | Redis or in-app trie |
Elasticsearch sits in the retrieval layer described in information retrieval — pair keyword search here with semantic reranking when queries are vague or cross-lingual.
Common pitfalls
- Dynamic mapping surprises — a field indexed as
textwhen you neededkeywordrequires reindexing. - Searching
textfields withtermqueries — analyzed tokens never exact-match; usekeywordsubfields. - Too many small shards — each shard is a Lucene index with overhead; 500 shards on a three-node cluster crawls.
- Deep pagination with
from/size— offset 10,000 forces sorting 10,000+ hits per shard; usesearch_afterorscroll(export only). - Wildcard queries on leading
*— scans entire index; avoid in user-facing search boxes. - Ignoring cluster yellow/red state — unassigned shards mean data loss risk or disk full.
- No index lifecycle management — log indices grow until disks die; roll over and delete old tiers.
- Treating Elasticsearch as the only database — updates and deletes are soft until merge; source of truth stays elsewhere.
Production checklist
- Explicit index templates with strict or dynamic:false mappings for production indices.
- Shard count planned from expected data volume; test with representative documents.
- At least one replica for production indices; snapshot repository configured (S3, GCS).
- Index Lifecycle Management (ILM) policies for time-series data — hot, warm, delete phases.
- Monitoring: cluster health, JVM heap pressure, indexing latency, search latency, disk watermarks.
- Security: TLS, authentication (API keys or SAML), index-level privileges, no public port 9200.
- Bulk indexing runbook with refresh and replica tuning documented.
- Alias-based index versioning for zero-downtime reindex migrations.
- Query slow-log enabled; top N slow queries reviewed weekly.
- Disaster recovery drill: restore snapshot to staging cluster quarterly.
Key takeaways
- Elasticsearch is a Lucene-backed distributed search engine — optimized for ranked text retrieval and aggregations, not OLTP.
- Mappings and analyzers define how text is tokenized; get them right before bulk indexing.
- Query DSL
boolqueries combine scoredmustclauses with cacheablefilterclauses. - Shards and replicas control scale and availability; plan shard size, avoid shard explosion.
- Sync from your primary database via CDC or event streams — Elasticsearch is a search index, not the system of record.
Related reading
- Semantic search explained — embedding-based retrieval that complements keyword BM25
- Information retrieval explained — precision, recall, and ranking foundations
- Apache Kafka explained — event streams that feed search indexes
- Hybrid search explained — combining BM25 keyword scores with vector similarity