Guide

LLM tokenization explained: BPE, tokens, and why context costs what it costs

Before a large language model reads your prompt, a tokenizer splits the text into integer IDs called tokens. API pricing, context-window limits, latency, and even model behavior all run on those token counts — not on words, characters, or pages. Two sentences that look equally short to a human can differ by 30% in tokens. Code blocks, JSON, and non-English text often balloon the count. This guide explains how modern tokenizers work (byte-pair encoding and relatives), why token boundaries surprise people, how to estimate counts before you send a request, and what tokenization means for context windows, RAG chunking, and inference cost.

What tokens are — the bridge between text and numbers

Neural networks operate on fixed-size vectors of numbers. A tokenizer is a deterministic function that maps text to a sequence of integers, each indexing an entry in a vocabulary (often 32,000 to 256,000 symbols depending on the model family). The model never sees the raw string hello; it sees something like token ID 15339, which the embedding layer converts into a high-dimensional vector.

Decoding reverses the process: integer sequence back to UTF-8 text. Tokenization is lossless for well-designed schemes — you can round-trip arbitrary Unicode — but the boundaries are arbitrary from a human perspective. The word “tokenization” might be one token in a mature English-heavy vocabulary or three tokens (token + ization) in another. That split affects how many positions the transformer must attend over, which directly hits speed and memory.

Different model families ship different tokenizers. GPT-style models historically used byte-pair encoding (BPE) derived from OpenAI’s tiktoken tables. Llama and many open-weight models use SentencePiece unigram or BPE trained on broader multilingual corpora. You cannot mix tokenizers: a prompt tokenized with the wrong vocabulary produces garbage embeddings even if the character string is identical.

Byte-pair encoding (BPE) in plain language

Byte-pair encoding starts from bytes (or characters) and iteratively merges the most frequent adjacent pairs into new vocabulary entries. Early merges capture common letter pairs like th and er; later merges capture whole words like the and subwords like ization. Training stops when the vocabulary reaches a target size — say 50,257 entries for a classic GPT-3 tokenizer.

At inference time, the tokenizer greedily applies the merge table: scan the string, apply the highest-priority merges, emit token IDs. Rare words decompose into several subword pieces; frequent words become single tokens. This balances vocabulary size against coverage: you do not need a dictionary entry for every possible English word, yet common terms stay compact.

Byte-level BPE (used in GPT-2 and descendants) operates on UTF-8 bytes with a base vocabulary of 256 byte values plus special tokens. Any Unicode character — emoji, CJK ideographs, rare symbols — can always be represented, sometimes as multiple byte tokens. That avoids the “unknown token” problem older word-piece models had, at the cost of more tokens for scripts underrepresented in training data.

SentencePiece and unigram language models

Google’s SentencePiece trains directly on raw text without pre-tokenizing on spaces, which helps languages without clear word boundaries (Japanese, Chinese). The unigram variant starts with a large candidate subword set and prunes pieces by likelihood. Llama 2/3 tokenizers use this family. Behavior differs from GPT BPE: the same English sentence may tokenize into a slightly different number of IDs, and multilingual text often fares better per character than in English-centric BPE tables.

Why token counts are not word counts

Rule of thumb for English prose with GPT-family tokenizers: ~4 characters per token, or roughly 0.75 words per token — but rules of thumb fail often. Whitespace, punctuation, and capitalization change splits. Compare:

ChatGPT — often one token (frequent brand string in training).
Chat GPT — two or three tokens (space breaks the merge).
12345678901234567890 — many single-digit or small-number tokens.
Indented JSON with long keys — token-heavy; minified JSON is cheaper.

Code is notoriously expensive: operators, brackets, and camelCase identifiers rarely appear as single merges unless the training corpus included lots of similar code. A 200-line Python file can consume thousands of tokens while a 200-line essay uses fewer. When budgeting context, measure the actual file — do not assume parity with natural language.

Multilingual text illustrates training-data bias. English and major European languages often compress well. Languages with different scripts may require several tokens per character if the tokenizer saw them rarely during training. Developers building global products should benchmark token counts per locale, not extrapolate from English-only tests.

Special tokens and chat templates

Vocabularies reserve IDs for special tokens that never appear in ordinary text: end-of-sequence markers, padding, mask tokens for training, and — in chat models — role delimiters like <|im_start|>user. Instruction-tuned models wrap your message in a chat template before tokenization. Those wrapper strings consume tokens you did not type; ignoring them causes off-by-dozens errors when filling a context window to the brim.

Fine-tuning pipelines must use the same template and special tokens as inference. A mismatch between training formatting and deployment formatting is a common source of degraded fine-tuned model quality even when loss curves looked healthy. Hugging Face model cards usually document the canonical template — copy it verbatim into production.

Counting tokens before you pay

Hosted APIs bill per input and output token. Self-hosted GPUs size KV cache from token length times batch width. Practical counting approaches:

Provider SDKs — OpenAI, Anthropic, and others expose token-count endpoints or client helpers that run the official tokenizer.
tiktoken — Python library with encoding tables matching GPT-3.5/4 families; call encoding.encode(text) for an exact list of IDs.
Hugging Face AutoTokenizer — load the model’s tokenizer by name; essential for Llama, Mistral, and open weights.
Online counters — quick sanity checks; verify they use the right model encoding before trusting a production budget.

Count both prompt and expected completion when estimating cost. Output tokens are often pricier than input on commercial APIs. For streaming UIs, partial output still accumulates billable tokens. Batch jobs should log token usage per request — it is the ground truth for unit economics, more reliable than character heuristics.

Tokenization and the context window

A model advertised with a 128K context window means 128,000 tokenizer positions — not 128,000 words. Long documents, tool-return payloads, and retrieved RAG passages all compete for the same budget. Once you approach the limit, you must truncate, summarize, or retrieve more selectively; see our context windows guide for sliding-window attention, prompt caching, and summarization strategies.

Attention cost scales superlinearly with sequence length in vanilla transformers (quadratic in full attention layers). More tokens mean slower prefill and a larger KV cache during decode — the dominant memory consumer at long contexts. That is why quantized KV caches and architectural shortcuts (grouped-query attention, sliding windows) matter as much as raw parameter count.

Implications for RAG chunking

Retrieval systems split documents into chunks before embedding. Chunk size is usually specified in tokens (256–512 common) because the downstream LLM consumes tokens, not characters. Too-small chunks lose semantic context; too-large chunks waste retrieval precision and burn context on irrelevant paragraphs. Token-aware splitters (respecting tokenizer boundaries) avoid cutting mid-subword, which would produce malformed fragments that embed poorly. Align chunk size with your embedding model’s limit and the reader LLM’s typical retrieval slot budget.

Tokenizer quirks that affect product behavior

Leading/trailing spaces — Some tokenizers treat a leading space as part of the next token ( hello vs hello differ). Prompt libraries that trim aggressively can unintentionally change token IDs and model behavior.

Reversible vs display form — Decoded text may include spacing artifacts (GPT-style byte-level BPE uses a meta-symbol for word starts). Never show raw decoded tokens to users without the model’s detokenizer; debugging UIs should render the merged string.

Security — Tokenization is not encryption. Splitting a secret string across tokens does not hide it from the model or from log retention. Treat prompts with the same secrecy policy regardless of how they tokenize; see prompt injection for adversarial patterns that exploit tool access, not tokenizer edge cases.

Evaluation fairness — Benchmarks that cap “length” in words disadvantage token-heavy inputs. When comparing models, normalize by tokenizer output length or use the same tokenizer for all candidates; our LLM evaluation guide covers dataset design in more depth.

Practical checklist for builders

Match the tokenizer to the model weights — never assume GPT encodings for Llama weights.
Measure real prompts — include system messages, tool schemas, and chat wrappers in the count.
Tokenize code and JSON separately — structured payloads often dominate cost; compress keys and strip whitespace where safe.
Size RAG chunks in tokens — use tokenizer-aware splitters; validate retrieval quality when you change chunk boundaries.
Log usage in production — input/output token totals per request for billing reconciliation and anomaly detection.
Benchmark non-English locales — do not ship global features sized only on English token averages.

Tokenization sits below most application code, but it is the meter on every LLM invoice and the ruler on every context limit. Understanding BPE and its cousins turns mysterious “context exceeded” errors into predictable engineering trade-offs you can design around.