Explainer · 7 June 2026

How data serialization formats work: JSON, Protobuf, MessagePack, and Avro

Your microservice has a user object in memory — name, email, subscription tier, last login timestamp. To send it over HTTP, enqueue it on Kafka, or write it to S3, you must serialize it: turn structured data into an ordered byte sequence that another process can deserialize back into structure. The choice of format shapes bandwidth bills, parse latency, schema flexibility, and how painful upgrades are six months later. JSON dominates public REST APIs because humans can read it in browser devtools. Protocol Buffers and gRPC dominate internal RPC because they are compact and fast. MessagePack offers a middle path. Apache Avro was built for data lakes and Kafka where schemas evolve weekly. Understanding what each format encodes on the wire — and what it refuses to guarantee — saves you from silent data corruption and 10× payload bloat.

What serialization must decide

Every format answers the same design questions:

Text vs binary — Can you curl the payload and eyeball it, or do you need a hex dump and a schema?
Schema — Is the shape of data implied by field names in the message (schemaless), declared in a separate .proto / Avro IDL file (schema-first), or embedded in every message (self-describing)?
Field identity — Does the wire format repeat keys like "email" on every object, or use numeric field tags (Protobuf) that shrink with repetition?
Type system — Are integers always 64-bit? Are floats IEEE 754? Can you represent dates, decimals, or arbitrary-precision money safely?
Evolution — Can you add a field without breaking old readers? Rename a field? Change a type?

No format optimizes all axes. Pick based on who reads the bytes (humans vs machines), how often schemas change, and whether you control both ends of the connection.

JSON: the universal text interchange format

JSON (JavaScript Object Notation) encodes objects as { "key": value }, arrays as [ ... ], strings in double quotes with escape sequences, numbers as decimal literals, and booleans/null as bare tokens. It is schemaless: the consumer infers structure from the payload. Every HTTP JSON API, browser fetch response, and config file you have touched probably uses it.

Strengths: debuggability, universal library support, easy integration with JavaScript runtimes and document databases. Weaknesses: verbose — field names repeat on every object; numbers are text so parsing is slower than binary; no built-in date or binary blob type (Base64 strings instead); duplicate keys and ambiguous number precision are spec edge cases. Large integers beyond JavaScript's safe integer range (2⁵³) lose precision if parsed with native Number — a common bug when blockchain apps return lamports or token amounts as JSON numbers instead of strings.

JSON assumes valid UTF-8 text. Escaping and surrogate-pair handling matter when names or values carry emoji or non-Latin scripts.

Protocol Buffers: schema-first binary with field tags

Google's Protocol Buffers (protobuf) start with a .proto schema that declares messages, field numbers, and types. On the wire, each field is a tag (field number + wire type) followed by a value. Field names are not transmitted — only the numeric tags defined at compile time. That makes repeated messages much smaller than JSON: a million User records do not repeat the bytes for "email" six million times.

Protobuf encodes integers with variable-length encoding (varint): small values use fewer bytes. Strings and embedded messages are length-prefixed. Unknown fields are preserved or skipped, enabling backward-compatible evolution: add new optional fields with new numbers; never reuse or renumber existing tags. Removing a field? Reserve its number in the schema so nobody accidentally recycles it.

gRPC uses protobuf as its default payload on HTTP/2 multiplexed streams — see HTTP/2 and HTTP/3 for why one TCP connection carrying many RPCs beats opening a new HTTP/1.1 request per call. Protobuf is a poor fit for public APIs you want strangers to curl without generated stubs, and it does not self-describe: you need the schema file or compiled descriptors to decode arbitrary messages.

MessagePack: JSON's binary cousin

MessagePack serializes the same JSON data model (maps, arrays, strings, numbers, booleans, null) into a compact binary form. Small integers encode in a single byte; strings are length-prefixed without quote characters; map keys still appear on every object — unlike protobuf tags, the key strings ride along each time.

MessagePack is schemaless like JSON: no .proto file required. That makes it attractive when you want smaller payloads and faster parsing than JSON but cannot commit to a rigid schema — Redis clients, game state snapshots, or internal services that already speak JSON-shaped objects. Trade-off: you do not get protobuf's strong evolution rules; adding fields is usually safe, but changing types or semantics is still a coordination problem between producers and consumers.

Apache Avro: schemas for data pipelines

Avro targets high-volume event logs and analytics. Schemas are JSON documents (or IDL) registered in a schema registry. Each Avro file or Kafka message can carry a schema fingerprint; readers use a writer schema (what produced the bytes) and a reader schema (what the consumer expects) and Avro resolves differences field-by-field.

That resolution model shines when you rename columns, add defaults, or promote types in a warehouse where petabytes of historical data must remain readable. Avro rows are compact and column-friendly; Spark, Flink, and Kafka Connect treat Avro as a first-class citizen. Downside: tooling is heavier than JSON or MessagePack, and it is overkill for a simple mobile app API.

Pair Avro with message queues when consumers upgrade on different schedules and you cannot afford to replay or drop old topics every deploy.

Other formats worth knowing

CBOR (Concise Binary Object Representation) — IETF standard similar to MessagePack; used in COSE/JWT binary profiles and some IoT stacks.
FlatBuffers / Cap'n Proto — zero-copy access: parse without deserializing into separate heap objects; popular in games and mobile when latency dominates.
BSON — MongoDB's binary JSON extension with explicit binary and date types.
XML — still entrenched in enterprise SOAP and document interchange; verbose but schema-validatable (XSD).
Native language pickling — Python pickle, Java serialization: convenient but unsafe across trust boundaries; never expose to untrusted input.

Comparison at a glance

Human readable — JSON yes; protobuf, MessagePack, Avro binary no (Avro JSON encoding exists but is rare on the wire).
Typical size vs JSON — protobuf and Avro often 3–10× smaller on nested structs; MessagePack modestly smaller; depends on field names and numeric density.
Parse speed — binary formats generally faster; protobuf and FlatBuffers lead benchmarks; JSON pays UTF-8 parsing and decimal conversion.
Schema required — protobuf and Avro yes (Avro can embed); JSON and MessagePack no.
Best default use — JSON public REST/GraphQL; protobuf internal gRPC; MessagePack cache/session blobs; Avro Kafka/data lake.

Evolution and compatibility rules

Breaking changes usually come from renumbering protobuf fields, removing required Avro fields without defaults, or changing JSON field types from string to number while mobile apps still expect strings. Safe patterns:

Additive changes only — new optional fields with defaults; old code ignores unknown fields (protobuf, Avro) or extra JSON keys.
Never reuse field numbers or names with different semantics.
Version your API at the URL or package level when you must make breaking changes — serialization alone cannot save you from deleting password_hash and hoping old clients cope.
Contract tests — golden-file round-trip tests that serialize sample messages and assert bytes or canonical JSON match expectations.

In event-driven systems, consumers lag producers by hours; schema registries block deployment of incompatible Avro schemas before poison messages hit the topic.

Numeric and money pitfalls

JSON and JavaScript share the same float problem documented in IEEE 754 floating point: financial amounts and token balances should serialize as integers in the smallest unit (lamports, wei, satoshis) or as decimal strings — not as JSON floats. Protobuf offers int64, sint64, and third-party decimal extensions; Avro has fixed and logical decimal types. Picking the wrong type here is a production incident waiting for a large transfer.

How to choose in practice

Public HTTP API, browser clients, quick iteration — JSON (optionally compressed with gzip or Brotli at the transport layer).
Internal service mesh, strict contracts, high QPS — protobuf + gRPC; generate stubs in Go, Rust, Java, TypeScript.
Redis/cache/session storage of JSON-shaped blobs — MessagePack if size matters and both sides use compatible libraries.
Kafka, S3 data lake, Spark jobs, schema registry — Avro or protobuf with registry; Avro when reader/writer schema resolution is central.
Blockchain RPC and on-chain programs — custom binary layouts (Borsh, Anchor) optimized for deterministic size and parse cost; JSON only at the wallet/UI boundary.

Implementation checklist

Measure payload size and p99 parse time on realistic messages — not toy structs.
Document whether unknown fields are ignored or rejected.
Store money and large integers as strings in JSON; use int64 or decimal types in binary schemas.
Register schemas before publishing to shared topics; block incompatible changes in CI.
Log content-type and schema version on deserialization failures for fast debugging.
Avoid language-native pickling across service boundaries.