Explainer · 7 June 2026

How Unicode, UTF-8, and character encoding work

Every API that accepts a username, every database that stores a product title, and every wallet that signs a human-readable memo eventually hits the same question: what is a character, and how do you store it as bytes? Unicode is the global catalog that assigns a number — a code point — to nearly every written symbol in use. UTF-8 is the dominant encoding that turns those numbers into a byte stream computers can move over HTTP, write to disk, and hash. Confusing the catalog with the encoding, or treating one user-visible glyph as one array slot, is how production systems ship subtle bugs that only appear when someone types an accent, an emoji, or a combined Korean syllable.

From ASCII to a universal character set

Early computers standardized on ASCII: 128 code points mapping bytes 0–127 to English letters, digits, and punctuation. Extended 8-bit encodings (Latin-1, Windows-1252, Shift JIS) reused the high byte range for different national scripts — so the byte 0xE9 might mean é in French but something else in Japanese. Opening a file without knowing its encoding produced mojibake — the garbled diamond question marks you still see in mislabeled email.

Unicode (formally ISO/IEC 10646) replaces that tower of incompatible tables with one namespace. Code point U+0041 is always Latin capital A; U+20AC is always the euro sign; U+1F600 is always the grinning-face emoji. The standard also defines properties — letter vs number, uppercase mapping, bidirectional class — that power search, sorting, and case-insensitive comparison. Unicode does not dictate how many bytes store U+0041; that is the job of an encoding scheme.

Code points, scalars, and what users actually see

Developers often say "character" when they mean one of three different things:

  • Code point — an integer in the range U+0000..U+10FFFF (with surrogate blocks reserved for UTF-16). The string "e\u0301" is two code points: Latin small e plus combining acute accent.
  • UTF-8 code unit sequence — the actual bytes on the wire. That same logical é might be one code point (U+00E9, precomposed) encoded as two bytes, or two code points encoded as three bytes — same appearance, different bytes.
  • Grapheme cluster — what a reader perceives as one character on screen. Family emoji with skin-tone modifiers, flags built from regional-indicator pairs, and Indic conjuncts can span multiple code points but count as one grapheme for cursor movement and backspace.

Calling strlen or len(s) on UTF-8 returns byte length, not grapheme count. JavaScript's s.length counts UTF-16 code units — so astral symbols like emoji often consume two "length" units. Swift's String.count is closer to grapheme-aware but still surprises around flags. For user-facing limits ("username max 32 characters"), you need an explicit grapheme or extended grapheme cluster algorithm (Unicode UAX #29), not a raw byte or code-unit cap.

How UTF-8 encodes code points

UTF-8 is a variable-width, self-synchronizing encoding: ASCII bytes 0x00–0x7F encode themselves unchanged, which is why valid UTF-8 that contains only ASCII is also valid ASCII. Bytes with the high bit set start multibyte sequences whose leading byte tells you the total length:

  • 0xxxxxxx — 1 byte, code points U+0000..U+007F
  • 110xxxxx 10xxxxxx — 2 bytes, up to U+07FF
  • 1110xxxx 10xxxxxx 10xxxxxx — 3 bytes, up to U+FFFF
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx — 4 bytes, up to U+10FFFF

Continuation bytes always begin with 10, so you can find character boundaries by scanning backward from any offset — useful for log tailers and network parsers that slice mid-packet. Invalid sequences (lone continuation bytes, overlong encodings that represent ASCII with four bytes, out-of-range scalars) must be rejected or replaced; permissive decoders that accept malformed UTF-8 become security footguns when combined with hash tables or path normalization — two different byte strings can decode to the same text.

UTF-16 still appears inside Windows APIs, Java strings, and JavaScript engines as 16-bit code units with surrogate pairs for astral planes. UTF-32 stores fixed four-byte scalars — simple to index but wasteful for English-heavy text. On the public web, HTTP, JSON, HTML, and Rust's default String, UTF-8 won: no endianness debates on the wire, compact Latin scripts, and compatibility with C-style null-terminated tools that treat zero bytes as terminators only in the ASCII subset.

Normalization: when the same text has different bytes

Unicode allows multiple code-point sequences to render identically. The letter é can be stored as a single precomposed code point (U+00E9) or as e plus a combining acute (U+0065 U+0301). Both normalize to the same visual glyph but produce different UTF-8 byte strings — so naive string equality fails.

The standard defines normalization forms:

  • NFC (canonical composition) — prefer precomposed characters where defined; typical for web output and macOS filenames.
  • NFD (canonical decomposition) — split combining marks; common on macOS HFS+ / APFS disk representation and in some search tokenizers.
  • NFKC / NFKD — compatibility decomposition that also folds typographic variants (full-width digits, ligature fi) — aggressive; use carefully before cryptographic hashing.

Production rule: pick one form at system boundaries. Normalize to NFC before indexing in full-text search, before comparing usernames for uniqueness, and before computing HMACs over "user-visible" strings. Store the normalized form plus the original bytes if you need lossless round-trip for legal evidence. Never normalize passwords before hashing — users may intentionally choose exotic compositions — but do enforce a byte-length cap to prevent denial-of-service via megabyte combining-mark stacks.

BOM, endianness, and transport metadata

UTF-8 technically does not need a byte-order mark (BOM), but Windows Notepad historically prefixed files with EF BB BF. If your importer treats the BOM as part of the first field, CSV headers break silently. HTTP solves encoding at the protocol layer: Content-Type: text/html; charset=utf-8 and the HTML <meta charset="UTF-8"> declaration tell the browser how to decode bytes before JavaScript runs. JSON (RFC 8259) requires UTF-8, UTF-16, or UTF-32 with no ambiguity — yet APIs still return wrong charset headers, so client libraries should decode from bytes, not trust labels.

When binary formats embed text — SQLite, Protocol Buffers, PDF — the schema must state encoding explicitly. Blockchains often store arbitrary byte blobs on-chain; off-chain indexers that display memos must not assume Latin-1. Wallets showing "sign this message" should render UTF-8 decoded text and warn when homoglyphs (Cyrillic а vs Latin a) appear in URLs or recipient names.

Collation, case mapping, and locale

Sorting code points by numeric value is not alphabetical order in any human language. Swedish sorts z before å; German phonebook order treats ö as oe. Unicode's default collation (UCA) plus CLDR locale data powers Intl.Collator in browsers and strcoll with ICU in servers. Case folding for case-insensitive comparison is locale-sensitive too: Turkish dotted and dotless I break naive toLowerCase() rules.

For identifiers (slugs, package names), stick to ASCII subsets like [a-z0-9_-] and reject everything else at validation time — simpler than full Unicode case folding. For display names and search, invest in proper locale-aware collation and NFC normalization upstream.

Common production bugs (and fixes)

  • Truncating by byte index — slicing UTF-8 at 255 bytes can split a multibyte character; decode to code points or graphemes first, then truncate, then re-encode.
  • Regex with byte semantics. may match half an emoji; use Unicode-aware flags (/u in JavaScript, (?u) in some engines) or dedicated libraries.
  • Database charset mismatches — tables created as latin1 with connection UTF-8 corrupt stored emoji; migrate to utf8mb4 in MySQL or native UTF-8 in Postgres.
  • Filename NFC vs NFD — zip archives built on macOS (NFD) vs Linux servers (often NFC) break deduplication; normalize on ingest.
  • Substring search without normalization — users search cafe but the document contains café as decomposed marks; normalize both sides or use a search engine with an analyzer.
  • Emoji in passwords — valid Unicode, but mobile keyboards change emoji presentation; prefer passphrases of words if emoji cause support burden.

Practical checklist

  • Declare UTF-8 at every boundary: HTTP headers, HTML meta, database connection, file writes.
  • Normalize to NFC for equality, uniqueness checks, and search indexes unless you have a specific reason for NFD.
  • Measure limits in graphemes or code points for UX; measure storage in bytes for capacity planning.
  • Reject invalid UTF-8 at ingress; do not round-trip through Latin-1 "fixes."
  • Use locale-aware collation for sorted lists shown to humans; use raw byte comparison only for cryptographic identifiers.
  • Test with combining accents, ZWJ family emoji, RTL Arabic, and mixed scripts — not just ASCII fixtures.

Related on Solana Garden: inverted indexes and full-text search, lossless compression explained, hash tables explained, database indexing explained, Explainers hub.