Explainer · 7 June 2026
How Unicode, UTF-8, and character encoding work
Every API that accepts a username, every database that stores a product title, and every wallet that signs a human-readable memo eventually hits the same question: what is a character, and how do you store it as bytes? Unicode is the global catalog that assigns a number — a code point — to nearly every written symbol in use. UTF-8 is the dominant encoding that turns those numbers into a byte stream computers can move over HTTP, write to disk, and hash. Confusing the catalog with the encoding, or treating one user-visible glyph as one array slot, is how production systems ship subtle bugs that only appear when someone types an accent, an emoji, or a combined Korean syllable.
From ASCII to a universal character set
Early computers standardized on ASCII: 128 code points mapping
bytes 0–127 to English letters, digits, and punctuation. Extended 8-bit
encodings (Latin-1, Windows-1252, Shift JIS) reused the high byte range for
different national scripts — so the byte 0xE9 might mean
é in French but something else in Japanese. Opening a file
without knowing its encoding produced mojibake — the garbled diamond question
marks you still see in mislabeled email.
Unicode (formally ISO/IEC 10646) replaces that tower of incompatible tables
with one namespace. Code point U+0041 is always Latin capital A;
U+20AC is always the euro sign; U+1F600 is always the
grinning-face emoji. The standard also defines properties — letter vs number,
uppercase mapping, bidirectional class — that power search, sorting, and
case-insensitive comparison. Unicode does not dictate how many bytes
store U+0041; that is the job of an encoding scheme.
Code points, scalars, and what users actually see
Developers often say "character" when they mean one of three different things:
- Code point — an integer in the range U+0000..U+10FFFF
(with surrogate blocks reserved for UTF-16). The string
"e\u0301"is two code points: Latin small e plus combining acute accent. - UTF-8 code unit sequence — the actual bytes on the wire.
That same logical
émight be one code point (U+00E9, precomposed) encoded as two bytes, or two code points encoded as three bytes — same appearance, different bytes. - Grapheme cluster — what a reader perceives as one character on screen. Family emoji with skin-tone modifiers, flags built from regional-indicator pairs, and Indic conjuncts can span multiple code points but count as one grapheme for cursor movement and backspace.
Calling strlen or len(s) on UTF-8 returns
byte length, not grapheme count. JavaScript's
s.length counts UTF-16 code units — so astral symbols like emoji
often consume two "length" units. Swift's String.count is closer
to grapheme-aware but still surprises around flags. For user-facing limits
("username max 32 characters"), you need an explicit grapheme or extended
grapheme cluster algorithm (Unicode UAX #29), not a raw byte or code-unit cap.
How UTF-8 encodes code points
UTF-8 is a variable-width, self-synchronizing encoding: ASCII bytes 0x00–0x7F encode themselves unchanged, which is why valid UTF-8 that contains only ASCII is also valid ASCII. Bytes with the high bit set start multibyte sequences whose leading byte tells you the total length:
0xxxxxxx— 1 byte, code points U+0000..U+007F110xxxxx 10xxxxxx— 2 bytes, up to U+07FF1110xxxx 10xxxxxx 10xxxxxx— 3 bytes, up to U+FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx— 4 bytes, up to U+10FFFF
Continuation bytes always begin with 10, so you can find character
boundaries by scanning backward from any offset — useful for log tailers and
network parsers that slice mid-packet. Invalid sequences (lone continuation
bytes, overlong encodings that represent ASCII with four bytes, out-of-range
scalars) must be rejected or replaced; permissive decoders that accept
malformed UTF-8 become security footguns when combined with
hash
tables or path normalization — two different byte strings can decode to the
same text.
UTF-16 still appears inside Windows APIs, Java strings, and
JavaScript engines as 16-bit code units with surrogate pairs for astral
planes. UTF-32 stores fixed four-byte scalars — simple to
index but wasteful for English-heavy text. On the public web, HTTP, JSON, HTML,
and Rust's default String, UTF-8 won: no endianness debates on the
wire, compact Latin scripts, and compatibility with C-style null-terminated
tools that treat zero bytes as terminators only in the ASCII subset.
Normalization: when the same text has different bytes
Unicode allows multiple code-point sequences to render identically. The letter
é can be stored as a single precomposed code point
(U+00E9) or as e plus a combining acute
(U+0065 U+0301). Both normalize to the same visual glyph but
produce different UTF-8 byte strings — so naive string equality fails.
The standard defines normalization forms:
- NFC (canonical composition) — prefer precomposed characters where defined; typical for web output and macOS filenames.
- NFD (canonical decomposition) — split combining marks; common on macOS HFS+ / APFS disk representation and in some search tokenizers.
- NFKC / NFKD — compatibility decomposition that also folds typographic variants (full-width digits, ligature fi) — aggressive; use carefully before cryptographic hashing.
Production rule: pick one form at system boundaries. Normalize to NFC before indexing in full-text search, before comparing usernames for uniqueness, and before computing HMACs over "user-visible" strings. Store the normalized form plus the original bytes if you need lossless round-trip for legal evidence. Never normalize passwords before hashing — users may intentionally choose exotic compositions — but do enforce a byte-length cap to prevent denial-of-service via megabyte combining-mark stacks.
BOM, endianness, and transport metadata
UTF-8 technically does not need a byte-order mark (BOM), but Windows Notepad
historically prefixed files with EF BB BF. If your importer treats
the BOM as part of the first field, CSV headers break silently. HTTP solves
encoding at the protocol layer: Content-Type: text/html; charset=utf-8
and the HTML <meta charset="UTF-8"> declaration tell the
browser how to decode bytes before JavaScript runs. JSON (RFC 8259) requires
UTF-8, UTF-16, or UTF-32 with no ambiguity — yet APIs still return wrong
charset headers, so client libraries should decode from bytes, not trust labels.
When binary formats embed text — SQLite, Protocol Buffers, PDF — the schema
must state encoding explicitly. Blockchains often store arbitrary byte blobs
on-chain; off-chain indexers that display memos must not assume Latin-1.
Wallets showing "sign this message" should render UTF-8 decoded text and warn
when homoglyphs (Cyrillic а vs Latin a) appear in
URLs or recipient names.
Collation, case mapping, and locale
Sorting code points by numeric value is not alphabetical order in any human
language. Swedish sorts z before å; German phonebook
order treats ö as oe. Unicode's default collation
(UCA) plus CLDR locale data powers Intl.Collator in browsers and
strcoll with ICU in servers. Case folding for case-insensitive
comparison is locale-sensitive too: Turkish dotted and dotless I break naive
toLowerCase() rules.
For identifiers (slugs, package names), stick to ASCII subsets like
[a-z0-9_-] and reject everything else at validation time — simpler
than full Unicode case folding. For display names and search, invest in proper
locale-aware collation and NFC normalization upstream.
Common production bugs (and fixes)
- Truncating by byte index — slicing UTF-8 at 255 bytes can split a multibyte character; decode to code points or graphemes first, then truncate, then re-encode.
- Regex with byte semantics —
.may match half an emoji; use Unicode-aware flags (/uin JavaScript,(?u)in some engines) or dedicated libraries. - Database charset mismatches — tables created as
latin1with connection UTF-8 corrupt stored emoji; migrate toutf8mb4in MySQL or native UTF-8 in Postgres. - Filename NFC vs NFD — zip archives built on macOS (NFD) vs Linux servers (often NFC) break deduplication; normalize on ingest.
- Substring search without normalization — users search
cafebut the document containscaféas decomposed marks; normalize both sides or use a search engine with an analyzer. - Emoji in passwords — valid Unicode, but mobile keyboards change emoji presentation; prefer passphrases of words if emoji cause support burden.
Practical checklist
- Declare UTF-8 at every boundary: HTTP headers, HTML meta, database connection, file writes.
- Normalize to NFC for equality, uniqueness checks, and search indexes unless you have a specific reason for NFD.
- Measure limits in graphemes or code points for UX; measure storage in bytes for capacity planning.
- Reject invalid UTF-8 at ingress; do not round-trip through Latin-1 "fixes."
- Use locale-aware collation for sorted lists shown to humans; use raw byte comparison only for cryptographic identifiers.
- Test with combining accents, ZWJ family emoji, RTL Arabic, and mixed scripts — not just ASCII fixtures.
Related on Solana Garden: inverted indexes and full-text search, lossless compression explained, hash tables explained, database indexing explained, Explainers hub.