Explainer · 7 June 2026

How hash tables and hash maps work

You look up a user by email, a token price by mint address, or a session by cookie ID — and expect an answer in microseconds, not a linear scan through millions of rows. A hash table (often exposed as a hash map or dictionary) is the workhorse that makes that possible. It maps arbitrary keys to values by running each key through a hash function that produces an integer, then using that integer to index into an array of slots. Average-case lookup, insert, and delete are O(1) — constant time — which is why every major language runtime, database engine, and CDN cache layer builds on the same core idea, with different collision strategies and security hardening on top.

The core idea: keys to slots

Imagine a fixed array of buckets numbered 0 through n − 1. To store ("alice", 42), compute h("alice") mod n — say slot 7 — and place the pair there. To retrieve "alice", hash again, jump to slot 7, read 42. No search through unrelated keys.

The hash function's job is to spread keys uniformly across slots so no single bucket hoards most entries. Good functions (MurmurHash, xxHash, SipHash, FNV-1a) avalanche small input changes into large output changes: flipping one bit in the key should flip roughly half the output bits. Bad functions — like taking the first character's ASCII value — cluster similar strings into the same buckets and destroy performance.

Cryptographic hashes (SHA-256) are overkill for in-memory maps: they are slow and designed for unpredictability against adversaries, not speed against random keys. Non-crypto hashes trade some predictability for throughput. Modern runtimes pick per use case, and that choice matters when attackers can choose keys (see hash flooding below).

Collisions: when two keys want the same slot

The pigeonhole principle guarantees collisions: there are infinitely many possible keys but only finitely many slots. h("alice") mod 16 and h("bob") mod 16 can land on the same index. Implementations handle this in two families:

Separate chaining — each bucket holds a linked list (or small tree) of entries that hashed to that index. Insert appends to the list; lookup walks the list comparing keys with == (or a custom equality). Java's HashMap before Java 8 used chaining; when a bucket's list grows past a threshold it converts to a red-black tree for O(log n) worst case within that bucket.

Open addressing — every entry lives directly in the array. On collision, probe forward to the next empty slot using a fixed sequence: linear probing (slot + 1, + 2, …), quadratic probing (+1, +4, +9, …), or double hashing (a second hash function defines the step size). Python 3.7+ dicts and Rust's default HashMap use open addressing variants. Robin Hood hashing improves fairness by stealing slots from "rich" entries (those far from their ideal index) to keep maximum probe distance low.

Neither family is universally superior. Chaining tolerates high load factors but adds pointer overhead. Open addressing is cache-friendly (contiguous memory) but degrades sharply if the table fills up — which is why load factor caps matter.

Load factor and rehashing

Load factor α = entries / slots. As α approaches 1, collisions multiply. Most implementations trigger rehashing when α crosses a threshold (often 0.75): allocate a larger array (typically 2×), recompute every key's index modulo the new size, and copy entries. Amortized over many inserts, each operation stays O(1) average, but a single resize can pause the process for milliseconds — visible in latency spikes on hot paths.

You cannot always pre-size perfectly, but if you know you will store 10,000 entries, requesting an initial capacity of ~13,000 (α ≈ 0.75) avoids several resize rounds. Go maps and Java HashMap constructors accept hints; Python dicts grow heuristically as you insert.

Deletion in open-addressed tables is subtle: clearing a slot to "empty" breaks probe chains for keys that wrapped past it. Standard fix: mark slots as tombstones (deleted but probe-skippable) or run occasional compaction. Chaining simply unlinks the node.

What makes a key hashable

Keys must be immutable while in the table — if a key's hash changes after insertion, you lose it forever. That is why Python lists cannot be dict keys but tuples can (if their elements are hashable). Java requires consistent hashCode() and equals(). Rust demands Hash + Eq.

For composite keys (country + zip code), either hash the concatenation with a delimiter or use a tuple type the runtime already knows how to hash. Rolling your own string concatenation without a separator can collide distinct pairs: ("ab", "c") vs ("a", "bc").

Floating-point keys are legal but fragile: 0.1 + 0.2 != 0.3 in IEEE arithmetic can produce "missing" entries if you compute keys differently on insert vs lookup. Prefer integers or normalized decimal types for money.

Hash flooding and defensive hashing

If an attacker can submit HTTP form fields whose names hash to the same bucket, they turn your O(1) map into O(n) per request — a hash flooding denial-of-service. The 2011 disclosures against Python, Ruby, and other languages prompted widespread changes: per-process random hash seeds, switching to SipHash or similar keyed hashes for attacker-controlled strings, and limiting request parameter counts.

Rule of thumb: fast non-crypto hashes for internal IDs you generate; keyed or randomized hashes for user-supplied map keys in network-facing code. Never expose raw hash values as authentication tokens — preimage resistance is a property of crypto hashes, not MurmurHash.

Where hash tables show up beyond `dict`

Symbol tables — compilers map variable names to stack slots; interpreters map names to objects.
Memoization — dynamic programming caches keyed by argument tuples.
Deduplication — seen = set() is a hash table with dummy values; membership is O(1) average.
Database hash indexes — equality lookups on indexed columns without B-tree log factor (see our database indexing guide for when B-trees win on range queries).
Distributed sharding — consistent hashing places keys on a ring of servers; same hash intuition, different scale.
Probabilistic structures — Bloom filters use multiple hash functions over a bit array for space-efficient "maybe present" tests.
Content addressing — Merkle trees and Git object IDs are cryptographic hashes used as immutable names, not for speed — but the mental model of key → slot parallels content → address.

Runtime snapshots (2026)

Implementations differ in iteration order guarantees, memory overhead, and concurrency:

Python 3.7+ — dict preserves insertion order as a side effect of compact open addressing; sets share the same machinery.
Java HashMap — chaining with treeification; iteration order undefined; ConcurrentHashMap stripes locks per bucket for parallel reads/writes.
Rust HashMap — SwissTable-style open addressing in std since recent versions; HashMap::with_hasher for custom hashers (e.g. faster non-crypto for integer keys).
Go map — open addressing with overflow buckets; not safe for concurrent write without a mutex; iteration order randomized per map to prevent order dependence.
C++ unordered_map — bucket interface exposing load factor and hash policy template parameters.

Choosing among them is rarely about asymptotics — all are O(1) average — and more about iterator stability, memory per entry, and whether your keys are attacker-controlled strings on a hot RPC path.

Hash tables vs sorted trees

Balanced trees (red-black, B-tree) offer O(log n) lookup but support ordered traversal and range scans. Hash tables win pure point lookups and inserts when order does not matter. Hybrid systems use both: hash index for equality, B-tree for WHERE price BETWEEN …. In-memory, if you need the k smallest keys frequently, a tree or heap may beat repeated hashing.

Practical checklist

Pick immutable, well-distributed keys; avoid floats for money.
Pre-size tables when entry count is known to skip resize pauses.
Use keyed or randomized hashes for user-supplied keys on network paths.
Monitor load factor and max chain/probe length in production caches — spikes predict latency cliffs before they become outages.
Do not use hash tables when you need sorted order or prefix range queries — use a tree index instead.
Remember deletion semantics in open addressing if you implement your own.
Test equality, not just hash equality — collisions require full key compare.