Explainer · 7 June 2026

How lossless compression algorithms work

A 4 MB JSON API response crosses the wire on every page load. gzip shrinks it to 400 KB. The client decompresses, parses, and renders — bit-for-bit identical to the original. That is lossless compression: exploit redundancy in the data, encode it more efficiently, and guarantee perfect reconstruction. Lossy codecs (JPEG, MP3) throw information away; lossless ones do not. Understanding the two-stage pipeline most modern formats use — find repeated structure, then assign shorter codes to frequent symbols — explains why text compresses beautifully, why encrypted blobs do not, and why your CDN bills care which algorithm you pick.

Entropy: the theoretical floor

Claude Shannon's information theory sets a hard limit. Each symbol in a message carries a certain amount of unpredictability measured in bits of entropy. English prose averages roughly 1–2 bits per character even though each letter is stored as 8 bits in ASCII. The gap between storage size and true information content is redundancy — and redundancy is what compressors eat.

A perfectly random byte stream has ~8 bits of entropy per byte. No algorithm can compress it losslessly without growing the file (you need metadata to decompress). This is why you cannot gzip an already-compressed PNG or a properly encrypted wallet backup and expect savings — you are already near the entropy floor. Compressors that "work" on random data are either lossy, buggy, or adding expansion overhead.

Simple patterns: run-length and dictionary ideas

The earliest schemes are instructive even if you never use them directly. Run-length encoding (RLE) replaces runs of identical bytes with a count: AAAAABBB becomes 5A3B. It excels on fax lines and simple bitmaps; it fails on heterogeneous text.

LZ77 (Lempel–Ziv, 1977) generalizes the idea: instead of only repeating the same byte, look backward in a sliding window and replace upcoming bytes with a pointer — (distance, length) back to an earlier occurrence. The phrase "the quick brown fox" contains "the" twice; the second "the" becomes "go back 16 bytes, copy 3." No separate dictionary file is stored; the decompressor rebuilds matches from data it has already seen. LZ78 and derivatives build explicit dictionaries, but the sliding-window model powers most formats you touch daily.

Huffman coding: shorter codes for common symbols

After LZ-style stages emit a stream of literals and pointers, many symbols appear far more often than others. Huffman coding builds a binary tree where frequent symbols get short bit sequences and rare symbols get long ones — always prefix-free so the decoder never ambiguates where one code ends and the next begins.

Example intuition: if the byte e appears 12% of the time and z appears 0.1%, assigning e a 3-bit code and z a 17-bit code beats fixed 8-bit storage on average. Huffman is optimal for a given static frequency table. Arithmetic coding and range coding squeeze slightly closer to entropy by encoding entire messages as fractions of a number interval; Brotli and some video codecs use variants. For teaching purposes, Huffman trees are the clearest mental model of entropy coding.

DEFLATE: LZ77 + Huffman = gzip and zlib

DEFLATE (RFC 1951) is the workhorse behind .gz files and HTTP Content-Encoding: gzip. Stage one applies LZ77 (window size typically 32 KB). Stage two Huffman-encodes the resulting literals and (distance, length) pairs. zlib wraps DEFLATE with a small header and Adler-32 checksum; gzip adds a different header/footer and CRC32 — same core, different container.

PNG uses DEFLATE on filtered image rows. Git packs objects with zlib. Solana transaction data on RPC responses is often gzip-compressed by nginx when clients send Accept-Encoding: gzip. The format is decades old but still everywhere because decompress speed is fast and compression ratio is decent on text-heavy payloads.

Modern codecs: speed vs ratio trade-offs

Production systems now pick among several descendants, each tuning the LZ + entropy balance differently:

LZ4 — extremely fast compress and decompress, modest ratio. Used for real-time logs, database page compression, and kernel buffers where CPU matters more than wire savings.
zstd (Zstandard) — Facebook/Meta design; wide compression-level range from "faster than gzip" to "better ratio than gzip at similar speed." Dictionary training on sample corpora helps repetitive API schemas. Increasingly default in databases and backup tools.
Brotli — Google design tuned for web text and fonts; static dictionaries for common HTML/JS tokens. Better ratio than gzip on text at cost of slower compression (fine for pre-compressed assets at build time). Common as Content-Encoding: br on CDNs.
Snappy / LZO — older "favor speed" choices still embedded in Hadoop pipelines and some RPC frameworks.

The right choice depends on where compression runs. Pre-compress assets at build time with Brotli level 11; compress RPC responses on the fly with gzip or zstd level 3; compress write-ahead logs with LZ4 so fsync latency stays predictable.

Lossless vs lossy: when each wins

Photos and audio mask discarded detail with perceptual models — JPEG quantizes DCT coefficients, MP3 removes frequencies your ear barely hears. Text, code, databases, and blockchain state need exact bytes back. A single flipped bit in a Solana transaction signature invalidates the whole message; lossy "compression" is not an option there.

Some domains use both: video containers losslessly store metadata while lossy codecs handle frames. LLM inference stacks experiment with lossy KV-cache compression (quantizing attention key/value tensors) because a little numerical drift may be acceptable — that is a different problem from shipping a JSON config file intact.

When compression fails or backfires

Already compressed or encrypted data — PNG, ZIP, AES ciphertext. Attempting gzip adds header overhead and wastes CPU.
Tiny payloads — compression headers can exceed savings on sub-kilobyte responses. Many frameworks skip compression below 1–2 KB.
CRIME/BREACH-class attacks — compressing HTTPS responses that mix attacker-controlled input with secrets can leak information through ciphertext length. Disable compression on sensitive authenticated pages or separate secrets from user input.
CPU-bound servers — aggressive Brotli on every dynamic response under load can spike latency more than bandwidth savings help. Precompute or cache compressed blobs.
Zip bombs and decompression bombs — tiny compressed files that expand to gigabytes. Always cap decompress output size in untrusted pipelines (package uploads, user imports).

How the pieces fit a web stack

A typical static site path: build tools emit .br and .gz variants of style.css and JS bundles. nginx or Cloudflare serves the best encoding the browser advertises in Accept-Encoding, with long Cache-Control headers so repeat visitors hit edge caches. Dynamic JSON APIs gzip on the fly when responses exceed the minimum size threshold. Database pages compress with LZ4 or zstd inside the storage engine so disk I/O drops without the application knowing.

The through-line: identify repeated structure (LZ family), then encode the resulting symbol stream efficiently (Huffman or better). Every lossless format you meet is a variation on that theme with different window sizes, hash chains, entropy coders, and container metadata.

Practical checklist

Pre-compress static assets at build time; use on-the-fly compression only for dynamic responses large enough to matter.
Advertise gzip and br (or zstd if your CDN supports it) and verify with curl -H 'Accept-Encoding: br'.
Never compress secrets mixed with attacker-controlled fields on the same TLS connection without mitigations.
Cap decompress size and time when handling uploads from users.
Measure end-to-end latency, not just byte counts — a smaller payload that costs 50 ms of CPU may be slower on mobile.
Store already-compressed media natively; do not double-wrap.