Guide
MongoDB fundamentals explained
Your product catalog ships with twelve fields; six months later marketing wants per-locale descriptions, variant-level inventory, and embedded review snippets. Migrating a rigid relational schema under traffic is painful. MongoDB stores data as BSON documents in collections — JSON-like records that can evolve field-by-field without ALTER TABLE migrations. That flexibility powers content platforms, IoT telemetry, mobile backends, and catalogs at companies from EA to Forbes. It is not a free pass to skip data modeling: unindexed document scans, unbounded arrays, and wrong embedding choices fail at scale just as surely as bad SQL. This guide covers MongoDB’s document model, CRUD and the aggregation pipeline, indexes, replica sets and read preferences, sharding, multi-document transactions, schema design patterns (embedding vs referencing), consistency trade-offs in the CAP spectrum, an e-commerce catalog worked example, when MongoDB beats PostgreSQL, common pitfalls, and a production checklist.
What MongoDB is
MongoDB is a document-oriented database — the leading member of the NoSQL family alongside key-value, wide-column, and graph stores. Each record is a self-contained document (typically one BSON object) stored in a collection (analogous to a table). Documents in the same collection need not share identical fields; MongoDB is schema-flexible, not schema-less — application code and validation rules still define what “valid” means.
BSON and the document model
BSON (Binary JSON) extends JSON with additional types: ObjectId,
Date, Decimal128, arrays, nested sub-documents, and
binary data. Every document has an _id field (auto-generated
ObjectId if omitted) that is unique within its collection and serves
as the primary key. Nested structures let you represent one-to-few relationships
inline — a user document with an embedded address array — without JOIN tables.
Hierarchy: deployment, database, collection
- Deployment — a standalone
mongodprocess, a replica set, or a sharded cluster. - Database — logical namespace (e.g.
shop) holding collections. - Collection — named bucket of documents (e.g.
products). - Document — one BSON record, up to 16 MB per document.
Applications connect via drivers (Node, Python, Go, etc.) or the
mongosh shell using a connection URI. Atlas is MongoDB’s managed
cloud; self-hosted replica sets on a VPS remain common for cost control.
CRUD and the query language
MongoDB queries use a JSON-style filter syntax rather than SQL strings:
- Create —
insertOne,insertMany; upserts viaupdateOnewithupsert: true. - Read —
findwith filters, projections (field selection), sort, skip, limit. - Update —
updateOne/updateManywith operators ($set,$inc,$push,$pull). - Delete —
deleteOne,deleteMany.
Query operators include comparison ($gt, $in), logical
($and, $or), element ($exists), and array
($elemMatch). The aggregation pipeline is MongoDB’s
analytics workhorse: a sequence of stages ($match, $group,
$lookup for left-outer-join-like enrichment, $unwind,
$project) that process documents server-side — often replacing
multi-table SQL for reporting.
For ad-hoc relational-style queries, $lookup joins another collection
at read time. It is convenient but not free: prefer embedding or denormalization
when join patterns are hot-path.
Indexes: the non-negotiable performance layer
Without indexes, MongoDB performs a collection scan — reading every document. That is acceptable for tiny collections and catastrophic for millions of rows. Indexes are B-tree structures (WiredTiger storage engine) keyed by field values, much like relational indexes.
Index types that matter
- Single-field —
{ sku: 1 }for equality lookups. - Compound —
{ category: 1, price: -1 }; order and prefix rules apply (ESR: Equality, Sort, Range). - Multikey — automatically created when indexing array fields.
- Text — full-text search on string fields (lighter than dedicated search engines).
- Geospatial —
2dspherefor GeoJSON point and polygon queries. - Partial — index only documents matching a filter (e.g. active listings).
- TTL — expire documents after a timestamp (sessions, cache rows).
Use explain("executionStats") on queries in staging — watch
totalDocsExamined vs nReturned. The ratio should be
close to 1:1 for indexed point queries. Every index slows writes and consumes RAM;
index sprawl is a real ops problem.
Replica sets and consistency
Production MongoDB runs as a replica set: one primary node accepting writes and one or more secondaries replicating the oplog (operation log). If the primary fails, an election promotes a secondary — automatic failover in seconds when configured correctly.
Write concern and read concern
Write concern controls durability: { w: "majority" }
waits until a majority of voting nodes acknowledge the write before returning —
safer, slightly slower. Read concern and
read preference control what you see:
primary(default) — linearizable reads from the leader.secondary/secondaryPreferred— scale read traffic; accept replication lag.majorityread concern — avoid reading uncommitted data rolled back after failover.
MongoDB sits in the “configurable consistency” camp of the CAP theorem: you choose per-operation guarantees rather than inheriting one global mode. Financial ledgers that need strict cross-row invariants often still prefer PostgreSQL; MongoDB shines when document boundaries match your consistency unit.
Sharding horizontal scale
When a single replica set exhausts disk, RAM, or write throughput, MongoDB
shards data across multiple replica sets. A mongos
router directs queries to the correct shard based on the shard key
— an indexed field (or compound key) hashed or ranged across chunks.
Shard key choice is permanent and painful to change: a monotonic key (auto-increment style) creates hot shards; a high-cardinality key with good distribution (user ID, tenant ID) spreads load. See our database sharding guide for general chunk migration and resharding concepts — MongoDB’s balancer moves chunks when they grow uneven.
Transactions and schema design
Multi-document ACID transactions
Since MongoDB 4.0, multi-document transactions across collections in the same replica set behave like familiar BEGIN/COMMIT blocks (with a default 60-second timeout). They carry overhead — use them for true invariants (debit/credit pairs), not as a crutch for bad document boundaries. Single-document updates are always atomic.
Embedding vs referencing
The central schema design question:
- Embed when data is read together, changes together, and stays bounded (order line items, user preferences).
- Reference when related data is large, shared across parents, or updated independently (author profile linked from many articles).
- Denormalize selectively — duplicate a product name on order documents to avoid joins on historical reads.
MongoDB’s flexibility does not remove the need for a schema contract: use JSON Schema validation in the database, Mongoose schemas in Node, or Pydantic models in Python so rogue fields do not accumulate.
Worked example: e-commerce product catalog
An online marketplace stores products with variants (size/color), per-locale titles, and rolling inventory counts. Relational modeling needs product, variant, locale, and inventory tables with four-way joins on every listing page. A MongoDB approach:
{
"_id": ObjectId("..."),
"sku": "HOODIE-BLK-M",
"category": "apparel",
"locales": {
"en-US": { "title": "Black Hoodie", "description": "..." },
"de-DE": { "title": "Schwarzer Hoodie", "description": "..." }
},
"variants": [
{ "size": "M", "color": "black", "stock": 42, "priceCents": 4999 },
{ "size": "L", "color": "black", "stock": 17, "priceCents": 4999 }
],
"tags": ["winter", "cotton"],
"updatedAt": ISODate("2026-06-08T10:00:00Z")
}
Indexes: compound { category: 1, "variants.priceCents": 1 } for
category browse sorted by price; text index on locales.en-US.title
for search; unique sparse index on sku. Listing pages read one
document per product — no joins. Inventory decrements use
updateOne({ sku, "variants.size": "M" }, { $inc: { "variants.$.stock": -1 } })
with a filter ensuring stock > 0. Order history embeds a snapshot of title
and price at purchase time in a separate orders collection.
When to choose MongoDB
| Scenario | MongoDB | PostgreSQL / SQL |
|---|---|---|
| Evolving document shapes (CMS, catalogs) | Strong fit | JSONB helps; migrations still needed for constraints |
| Heavy multi-table joins, reporting | Weaker; aggregation or ETL to warehouse | Strong fit |
| Strict financial ledger, complex constraints | Possible with transactions; not default choice | Strong fit |
| High write throughput, time-series bursts | Strong with TTL and sharding | Good; TimescaleDB extension for time-series |
| Geospatial queries on documents | Native 2dsphere indexes | PostGIS extension |
| Team knows SQL only | Learning curve on aggregation | Lower friction |
Many teams run both: Postgres for accounts, billing, and relational truth; MongoDB for product content and user-generated blobs. Polyglot persistence beats forcing one engine to do everything.
Common pitfalls
- No indexes on hot queries — full collection scans under production load.
- Unbounded arrays — embedding millions of log entries in one document hits the 16 MB limit and blocks rewrites.
- Monotonic shard keys — all writes hit one shard; throughput plateaus.
- Reading stale secondaries for “read your writes” UX without causal consistency settings.
- Schema anarchy — every service writes different field names; queries break silently.
- Overusing $lookup — reintroducing join costs the document model was meant to avoid.
- Ignoring connection pooling — opening a new connection per HTTP request exhausts file descriptors.
- Missing backups and PITR — replica sets are not backups; test restore drills.
Production checklist
- Replica set with at least three voting members (or Atlas M10+ with backups).
- Write concern
majorityfor durability-critical paths. - Indexes defined before launch;
explain()reviewed on top ten queries. - Schema validation rules enforced at the collection level.
- Connection pool sized per app instance (driver defaults are rarely optimal).
- Monitoring: opcounters, replication lag, cache eviction, slow query log.
- Backup policy with tested point-in-time recovery.
- Shard key chosen with distribution analysis before sharding (not as first resort).
- Secrets in env vars; TLS enabled for client connections.
- Runbook for failover, step-down, and rolling upgrades documented.
Key takeaways
- MongoDB stores flexible BSON documents in collections — strong for evolving schemas and document-shaped workloads.
- Indexes and schema design matter as much as in SQL; flexibility is not a substitute for modeling.
- Replica sets provide HA; write/read concern tune consistency vs latency.
- Sharding scales horizontally when shard keys distribute load evenly.
- Pair MongoDB with Postgres or a warehouse when relational integrity and analytics diverge.
Related reading
- PostgreSQL fundamentals explained — when relational ACID, joins, and JSONB are the better fit
- Database indexing explained — B-tree concepts shared across SQL and document stores
- Database sharding explained — shard keys, chunk migration, and hot-spot avoidance
- CAP theorem explained — tuning MongoDB read/write concern for your consistency needs