Guide

MongoDB fundamentals explained

Your product catalog ships with twelve fields; six months later marketing wants per-locale descriptions, variant-level inventory, and embedded review snippets. Migrating a rigid relational schema under traffic is painful. MongoDB stores data as BSON documents in collections — JSON-like records that can evolve field-by-field without ALTER TABLE migrations. That flexibility powers content platforms, IoT telemetry, mobile backends, and catalogs at companies from EA to Forbes. It is not a free pass to skip data modeling: unindexed document scans, unbounded arrays, and wrong embedding choices fail at scale just as surely as bad SQL. This guide covers MongoDB’s document model, CRUD and the aggregation pipeline, indexes, replica sets and read preferences, sharding, multi-document transactions, schema design patterns (embedding vs referencing), consistency trade-offs in the CAP spectrum, an e-commerce catalog worked example, when MongoDB beats PostgreSQL, common pitfalls, and a production checklist.

What MongoDB is

MongoDB is a document-oriented database — the leading member of the NoSQL family alongside key-value, wide-column, and graph stores. Each record is a self-contained document (typically one BSON object) stored in a collection (analogous to a table). Documents in the same collection need not share identical fields; MongoDB is schema-flexible, not schema-less — application code and validation rules still define what “valid” means.

BSON and the document model

BSON (Binary JSON) extends JSON with additional types: ObjectId, Date, Decimal128, arrays, nested sub-documents, and binary data. Every document has an _id field (auto-generated ObjectId if omitted) that is unique within its collection and serves as the primary key. Nested structures let you represent one-to-few relationships inline — a user document with an embedded address array — without JOIN tables.

Hierarchy: deployment, database, collection

Deployment — a standalone mongod process, a replica set, or a sharded cluster.
Database — logical namespace (e.g. shop) holding collections.
Collection — named bucket of documents (e.g. products).
Document — one BSON record, up to 16 MB per document.

Applications connect via drivers (Node, Python, Go, etc.) or the mongosh shell using a connection URI. Atlas is MongoDB’s managed cloud; self-hosted replica sets on a VPS remain common for cost control.

CRUD and the query language

MongoDB queries use a JSON-style filter syntax rather than SQL strings:

Create — insertOne, insertMany; upserts via updateOne with upsert: true.
Read — find with filters, projections (field selection), sort, skip, limit.
Update — updateOne/updateMany with operators ($set, $inc, $push, $pull).
Delete — deleteOne, deleteMany.

Query operators include comparison ($gt, $in), logical ($and, $or), element ($exists), and array ($elemMatch). The aggregation pipeline is MongoDB’s analytics workhorse: a sequence of stages ($match, $group, $lookup for left-outer-join-like enrichment, $unwind, $project) that process documents server-side — often replacing multi-table SQL for reporting.

For ad-hoc relational-style queries, $lookup joins another collection at read time. It is convenient but not free: prefer embedding or denormalization when join patterns are hot-path.

Indexes: the non-negotiable performance layer

Without indexes, MongoDB performs a collection scan — reading every document. That is acceptable for tiny collections and catastrophic for millions of rows. Indexes are B-tree structures (WiredTiger storage engine) keyed by field values, much like relational indexes.

Index types that matter

Single-field — { sku: 1 } for equality lookups.
Compound — { category: 1, price: -1 }; order and prefix rules apply (ESR: Equality, Sort, Range).
Multikey — automatically created when indexing array fields.
Text — full-text search on string fields (lighter than dedicated search engines).
Geospatial — 2dsphere for GeoJSON point and polygon queries.
Partial — index only documents matching a filter (e.g. active listings).
TTL — expire documents after a timestamp (sessions, cache rows).

Use explain("executionStats") on queries in staging — watch totalDocsExamined vs nReturned. The ratio should be close to 1:1 for indexed point queries. Every index slows writes and consumes RAM; index sprawl is a real ops problem.

Replica sets and consistency

Production MongoDB runs as a replica set: one primary node accepting writes and one or more secondaries replicating the oplog (operation log). If the primary fails, an election promotes a secondary — automatic failover in seconds when configured correctly.

Write concern and read concern

Write concern controls durability: { w: "majority" } waits until a majority of voting nodes acknowledge the write before returning — safer, slightly slower. Read concern and read preference control what you see:

primary (default) — linearizable reads from the leader.
secondary / secondaryPreferred — scale read traffic; accept replication lag.
majority read concern — avoid reading uncommitted data rolled back after failover.

MongoDB sits in the “configurable consistency” camp of the CAP theorem: you choose per-operation guarantees rather than inheriting one global mode. Financial ledgers that need strict cross-row invariants often still prefer PostgreSQL; MongoDB shines when document boundaries match your consistency unit.

Sharding horizontal scale

When a single replica set exhausts disk, RAM, or write throughput, MongoDB shards data across multiple replica sets. A mongos router directs queries to the correct shard based on the shard key — an indexed field (or compound key) hashed or ranged across chunks.

Shard key choice is permanent and painful to change: a monotonic key (auto-increment style) creates hot shards; a high-cardinality key with good distribution (user ID, tenant ID) spreads load. See our database sharding guide for general chunk migration and resharding concepts — MongoDB’s balancer moves chunks when they grow uneven.

Transactions and schema design

Multi-document ACID transactions

Since MongoDB 4.0, multi-document transactions across collections in the same replica set behave like familiar BEGIN/COMMIT blocks (with a default 60-second timeout). They carry overhead — use them for true invariants (debit/credit pairs), not as a crutch for bad document boundaries. Single-document updates are always atomic.

Embedding vs referencing

The central schema design question:

Embed when data is read together, changes together, and stays bounded (order line items, user preferences).
Reference when related data is large, shared across parents, or updated independently (author profile linked from many articles).
Denormalize selectively — duplicate a product name on order documents to avoid joins on historical reads.

MongoDB’s flexibility does not remove the need for a schema contract: use JSON Schema validation in the database, Mongoose schemas in Node, or Pydantic models in Python so rogue fields do not accumulate.

Worked example: e-commerce product catalog

An online marketplace stores products with variants (size/color), per-locale titles, and rolling inventory counts. Relational modeling needs product, variant, locale, and inventory tables with four-way joins on every listing page. A MongoDB approach:

{
  "_id": ObjectId("..."),
  "sku": "HOODIE-BLK-M",
  "category": "apparel",
  "locales": {
    "en-US": { "title": "Black Hoodie", "description": "..." },
    "de-DE": { "title": "Schwarzer Hoodie", "description": "..." }
  },
  "variants": [
    { "size": "M", "color": "black", "stock": 42, "priceCents": 4999 },
    { "size": "L", "color": "black", "stock": 17, "priceCents": 4999 }
  ],
  "tags": ["winter", "cotton"],
  "updatedAt": ISODate("2026-06-08T10:00:00Z")
}

Indexes: compound { category: 1, "variants.priceCents": 1 } for category browse sorted by price; text index on locales.en-US.title for search; unique sparse index on sku. Listing pages read one document per product — no joins. Inventory decrements use updateOne({ sku, "variants.size": "M" }, { $inc: { "variants.$.stock": -1 } }) with a filter ensuring stock > 0. Order history embeds a snapshot of title and price at purchase time in a separate orders collection.

When to choose MongoDB

Scenario	MongoDB	PostgreSQL / SQL
Evolving document shapes (CMS, catalogs)	Strong fit	JSONB helps; migrations still needed for constraints
Heavy multi-table joins, reporting	Weaker; aggregation or ETL to warehouse	Strong fit
Strict financial ledger, complex constraints	Possible with transactions; not default choice	Strong fit
High write throughput, time-series bursts	Strong with TTL and sharding	Good; TimescaleDB extension for time-series
Geospatial queries on documents	Native 2dsphere indexes	PostGIS extension
Team knows SQL only	Learning curve on aggregation	Lower friction

Many teams run both: Postgres for accounts, billing, and relational truth; MongoDB for product content and user-generated blobs. Polyglot persistence beats forcing one engine to do everything.

Common pitfalls

No indexes on hot queries — full collection scans under production load.
Unbounded arrays — embedding millions of log entries in one document hits the 16 MB limit and blocks rewrites.
Monotonic shard keys — all writes hit one shard; throughput plateaus.
Reading stale secondaries for “read your writes” UX without causal consistency settings.
Schema anarchy — every service writes different field names; queries break silently.
Overusing $lookup — reintroducing join costs the document model was meant to avoid.
Ignoring connection pooling — opening a new connection per HTTP request exhausts file descriptors.
Missing backups and PITR — replica sets are not backups; test restore drills.

Production checklist

Replica set with at least three voting members (or Atlas M10+ with backups).
Write concern majority for durability-critical paths.
Indexes defined before launch; explain() reviewed on top ten queries.
Schema validation rules enforced at the collection level.
Connection pool sized per app instance (driver defaults are rarely optimal).
Monitoring: opcounters, replication lag, cache eviction, slow query log.
Backup policy with tested point-in-time recovery.
Shard key chosen with distribution analysis before sharding (not as first resort).
Secrets in env vars; TLS enabled for client connections.
Runbook for failover, step-down, and rolling upgrades documented.

Key takeaways

MongoDB stores flexible BSON documents in collections — strong for evolving schemas and document-shaped workloads.
Indexes and schema design matter as much as in SQL; flexibility is not a substitute for modeling.
Replica sets provide HA; write/read concern tune consistency vs latency.
Sharding scales horizontally when shard keys distribute load evenly.
Pair MongoDB with Postgres or a warehouse when relational integrity and analytics diverge.