Guide

Elasticsearch fundamentals explained

Your storefront search box returns irrelevant results — users type “waterproof hiking boots” and see unrelated accessories because LIKE '%boot%' on PostgreSQL cannot rank by relevance, stem plurals, or tolerate typos at millisecond latency. Elasticsearch is a distributed search and analytics engine built on Apache Lucene: it stores inverted indexes that map terms to document IDs, scores hits with BM25, and shards data across nodes for horizontal scale. Netflix, GitHub, and the Elastic Stack use it for product catalogs, log pipelines, and security analytics. Elasticsearch is not a transactional database — it complements OLTP stores and pairs with hybrid search patterns when dense embeddings join keyword retrieval. This guide covers Lucene inverted indexes, index mappings and analyzers, the bool Query DSL, scoring and relevance tuning, aggregations and facets, shard and replica topology, bulk indexing and ingest pipelines, a Harbor Market e-commerce worked example, a search-engine decision table, common pitfalls, and a production checklist.

What Elasticsearch is

Elasticsearch (ES) is a document-oriented search engine. Each JSON document lives in an index (roughly a table). Documents are identified by _id and routed to a shard via a hash of _routing (default: document ID). A cluster of nodes holds indices; one node is elected master for cluster metadata, while data nodes store shards and execute queries.

Lucene under the hood

Each shard is a Lucene index — an immutable segment collection on disk. Lucene builds an inverted index: for every term (token), it stores a sorted list of document IDs and positions. A query walks these posting lists, intersects or unions them, and computes a relevance score. Segments merge in the background; deleted documents are tombstoned until merge compacts them away.

Elasticsearch adds distributed coordination: routing queries to the correct shards, merging scores across shards, replication for failover, and a REST API plus Query DSL on top of raw Lucene APIs. Managed offerings (Elastic Cloud, AWS OpenSearch Service) hide node provisioning; self-hosted clusters require JVM heap tuning and disk I/O planning.

Mappings, fields and analyzers

An index mapping defines field names, types, and how text is processed. Getting mappings right at index creation matters — changing analyzers on a live field usually requires reindexing.

Core field types

text — full-text search; analyzed (tokenized) at index and query time unless you specify otherwise.
keyword — exact-match strings for filters, sorting, and aggregations (SKU codes, status enums).
date, long, double, boolean — structured filters and range queries.
nested and object — arrays of objects; use nested when child objects must be queried as independent units.

Analyzers

An analyzer chains a character filter (optional), tokenizer (splits text into tokens), and token filters (lowercase, stop words, stemming). Index-time and search-time analyzers can differ — a common pattern is the standard analyzer at index time and a search_analyzer with synonym expansion at query time.

Use multi-fields when one logical field needs both behaviors: title as text for search and title.keyword for sorting. Synonym files and ICU tokenizers fix domain vocabulary (“laptop” = “notebook”) but increase index size and reindex cost when updated.

Query DSL: bool queries, filters and BM25

Elasticsearch queries are JSON. The bool query combines clauses:

must — required; contributes to score.
filter — required; no score impact (cached bitsets).
should — optional; boosts matching documents.
must_not — exclusion filter.

Put structured predicates (category, price range, in-stock flag) in filter clauses for speed. Put natural-language terms in must or should inside a multi_match query across title^3, description, and brand with field boosts.

BM25 relevance

Default scoring uses BM25 (Best Matching 25): term frequency saturates (repeating a word helps less after a point), rare terms score higher than common ones (inverse document frequency), and shorter fields rank terms more strongly. Tune with similarity settings or replace with a script_score for business rules (boost in-stock, penalize low margin).

For typo tolerance, add a match query with fuzziness: AUTO or a parallel match_phrase_prefix on autocomplete fields. For semantic similarity beyond keywords, index dense vectors and combine with BM25 in a hybrid retrieval pipeline rather than forcing keyword queries to guess intent.

Aggregations, facets and analytics

Aggregations compute analytics without returning hit documents — category counts, price histograms, date histograms for time series. Bucket aggregations group documents; metric aggregations compute stats (avg, percentiles, cardinality) inside buckets.

Faceted navigation on e-commerce sites is typically a terms aggregation on category.keyword with a filter context mirroring the user’s active filters. Log analytics stacks (ELK) lean on date_histogram aggregations per minute alongside structured logs ingested via bulk API.

Aggregations are memory-heavy on high-cardinality fields. Use composite aggregations for deep pagination through bucket keys and set size limits consciously — requesting ten thousand unique user IDs per query can OOM a data node.

Sharding, replicas and cluster topology

Each index splits into primary shards (write path) and optional replica shards (read scaling and failover). Shard count is fixed at index creation (unless you use index rollover / shrink APIs with constraints). Too many shards waste heap; too few limit write throughput and shard size.

Sizing heuristics

Target shard sizes roughly 10–50 GB for time-series; larger for static catalogs if heap allows.
Keep JVM heap at or below 50% of RAM and below ~32 GB (compressed OOPs threshold).
One replica minimum for production failover; two if you serve heavy read traffic.
Separate hot (SSD, recent data) and warm tiers for logs via index lifecycle management (ILM).

Document routing defaults to hash(_id) % num_primary_shards. Custom _routing co-locates related documents (all orders for one customer) but can create hot shards if routing keys skew. Cross-shard queries merge results on the coordinating node — very wide fan-out queries are latency-sensitive.

Indexing: bulk API, refresh and ingest pipelines

Single-document index requests are fine for low volume. Production backfills use the bulk API: newline-delimited action/metadata lines followed by source JSON. Batch 1,000–5,000 docs per request; tune until you saturate disk without triggering rejections.

Refresh and consistency

New documents are searchable after a refresh (default every 1s), which opens a new Lucene segment. Near-real-time search tolerates this; bulk ETL jobs should set refresh_interval: -1 during load and restore after. Use wait_for_active_shards when you need acknowledgment that replicas indexed the write before returning to clients.

Ingest pipelines

Ingest pipelines transform documents on the coordinating node before indexing: grok parsing for logs, geoip enrichment, renaming fields, dropping PII. Pipelines decouple producers from mapping details — application code ships raw events; Elasticsearch normalizes them. For change-data-capture from Postgres, tools like Logstash JDBC input or Debezium + Kafka sink reindex on every row update; design idempotent document IDs (product-{sku}).

Worked example: Harbor Market product search

Harbor Market is a mid-size outdoor gear storefront. Catalog data lives in Postgres (orders, inventory). Search must handle 80k SKUs, faceted filters (category, brand, price), typo-tolerant title search, and boost in-stock items.

Mapping sketch

{
  "mappings": {
    "properties": {
      "sku": { "type": "keyword" },
      "title": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "description": { "type": "text" },
      "brand": { "type": "keyword" },
      "category": { "type": "keyword" },
      "price": { "type": "scaled_float", "scaling_factor": 100 },
      "in_stock": { "type": "boolean" },
      "updated_at": { "type": "date" }
    }
  }
}

Search request

POST /products/_search
{
  "query": {
    "bool": {
      "must": [{
        "multi_match": {
          "query": "waterproof hiking boots",
          "fields": ["title^3", "description", "brand^2"],
          "fuzziness": "AUTO"
        }
      }],
      "filter": [
        { "term": { "category": "footwear" } },
        { "range": { "price": { "lte": 250 } } },
        { "term": { "in_stock": true } }
      ]
    }
  },
  "aggs": {
    "brands": { "terms": { "field": "brand", "size": 20 } }
  },
  "from": 0,
  "size": 24
}

Nightly CDC jobs bulk-index changed rows; a webhook on inventory updates issues single-document upserts by SKU. Category pages use cursor pagination on search_after with sort keys [_score, sku] for stable deep pages. Postgres remains source of truth for checkout; Elasticsearch serves read-only discovery.

Elasticsearch vs alternatives

Need	Elasticsearch	Alternative
Full-text search with ranking and facets	Strong fit	PostgreSQL `tsvector` (simpler, same-node OLTP)
Log/metrics time series at cluster scale	Elastic Stack, ILM tiers	ClickHouse, Loki, OpenSearch fork
Transactional CRUD and joins	Wrong tool	PostgreSQL, MySQL
Flexible schema documents, app JSON	Possible but secondary	MongoDB with Atlas Search
Managed vector + keyword hybrid	Dense vector fields + RRF	Pinecone, pgvector with extensions
Sub-millisecond key cache	Not designed for this	Redis

Common pitfalls

Dynamic mapping surprises — Elasticsearch guesses text where you needed keyword; disable dynamic mapping or use templates in production.
Too many shards — hundreds of tiny shards exhaust heap and slow cluster state; consolidate via rollover policies.
Heavy aggregations on high-cardinality fields — unique user IDs as terms agg buckets can crash nodes.
Scoring inside filter context — filters do not affect relevance; business boosts belong in should or function_score.
Ignoring reindex strategy — mapping changes require reindex; plan blue/green index aliases (products_v2 + alias swap).
Treating ES as primary store — replica loss and split-brain incidents happen; keep authoritative data in OLTP.
Unbounded from + size pagination — deep offset pagination is expensive; use search_after or scroll only for batch export.

Production checklist

Define explicit index templates and mappings before first production document.
Set shard count from expected data volume; use ILM for time-series indices.
Monitor heap usage, GC pauses, thread pool rejections, and disk watermarks.
Snapshot to S3 (or equivalent) on a schedule; test restore quarterly.
Use index aliases for zero-downtime reindex and versioned mapping changes.
Keep bulk batch sizes and refresh intervals tuned for your ingest SLA.
Secure clusters with TLS, role-based access, and network isolation.
Log slow queries; capture query DSL in APM for relevance debugging.
Cap aggregation size and use composite for deep bucket paging.
Document synonym and analyzer updates with a rehearsed reindex runbook.

Key takeaways

Elasticsearch is a distributed Lucene layer for ranked full-text search and analytics, not OLTP.
Mappings and analyzers determine search quality; plan them before indexing at scale.
Bool Query DSL separates filters (fast, exact) from scored text clauses (BM25).
Shard and replica topology trades write parallelism, failover, and query fan-out.
Bulk indexing, ILM, and aliases are the operational backbone of production search.