Guide

LLM tokenizer explained

Harbor Analytics routed support tickets through a three-model cascade — a small classifier, a mid-size resolver, and a 70B escalation path — and assumed prompts were comparable because every stage received the same UTF-8 string. In production logs, the classifier averaged 412 tokens per ticket while the escalation model averaged 318 tokens on identical text. Monthly spend overshot the budget by 19% before anyone noticed the classifier used a GPT-style byte-level BPE tokenizer with aggressive punctuation splits, while the escalation model used a Llama SentencePiece vocabulary that merged common support phrases into single tokens.

A tokenizer is the deterministic front door of every LLM pipeline: it maps raw text to integer token IDs the model embeds and attends over. Tokenization choices affect context-window headroom, inference cost, multilingual quality, and whether two models can share cached prefixes. This guide covers major algorithms (BPE, WordPiece, SentencePiece), vocabulary sizing tradeoffs, special tokens, multilingual fragmentation, a Harbor Analytics cost-router refactor, a technique decision table versus naive character counting, pitfalls, and a production checklist.

What tokenization does in the LLM stack

Neural language models do not read characters or words directly. The tokenizer performs a reversible (in practice, mostly reversible) encoding:

  1. Normalize text — Unicode NFC, lowercasing (sometimes), whitespace cleanup, optional byte fallback for unknown characters.
  2. Segment into subword units using a fixed vocabulary learned during tokenizer training.
  3. Map each subword to an integer ID; prepend/append special tokens (<|begin_of_text|>, <|end_of_text|>, role markers, tool-call delimiters).

The embedding layer looks up each ID; attention runs over the resulting sequence. Every downstream concern — token billing, KV cache size, truncation, and distillation logit alignment — depends on this encoding. Swapping models without swapping tokenizers is safe only when you re-count tokens and re-validate prompt templates.

Major tokenizer families

Byte Pair Encoding (BPE)

BPE starts from bytes or characters and iteratively merges the most frequent adjacent pairs until the vocabulary reaches a target size (often 32k–100k). GPT-2, GPT-3, GPT-4, and Codex families use byte-level BPE: the base alphabet is 256 bytes, so any Unicode string encodes without an unknown-token hole. Rare characters may consume several tokens (e.g. emoji or CJK in English-centric vocabularies), which inflates non-Latin prompts.

WordPiece

WordPiece (BERT, early T5) merges pairs that maximize language-model likelihood gain rather than raw frequency. Tokens often carry a ## prefix for continuations inside a word. WordPiece vocabularies are efficient for the training corpus language but fragment aggressively on out-of-domain technical strings (JSON keys, base64, wallet addresses).

SentencePiece (Unigram and BPE variants)

SentencePiece trains directly on raw UTF-8 without pre-tokenizing on spaces. Llama, Mistral, Gemma, and many open-weight models use it. The unigram variant starts with a large candidate set and prunes tokens by loss; the BPE variant inside SentencePiece behaves like standard BPE but with consistent handling of whitespace (often marks word boundaries). Because there is no “unknown” byte hole, code and mixed-language prompts tend to be more stable than WordPiece for the same vocabulary size.

Character- and word-level (legacy)

Character-level models exist in research but are rare in production LLMs because sequence length explodes. Word-level models cannot generalize to new words. Modern stacks overwhelmingly use subword tokenization.

Vocabulary size and compression tradeoffs

Vocabulary size |V| trades off sequence length against embedding and softmax cost:

  • Larger vocab — fewer tokens per document, shorter sequences, lower attention cost per character, but bigger embedding tables and more expensive final softmax during training.
  • Smaller vocab — longer sequences, more attention steps, but cheaper output layers and sometimes better morphological generalization on rare words.

Production models cluster around 32k (early GPT), 50k–65k (Llama 2/3), and 100k+ (GPT-4 class). For RAG pipelines, a 20% difference in tokens per chunk changes how many documents fit under a context cap and directly shifts embedding index storage when you store token counts for budgeting.

Compression ratio (characters per token) on English prose is often 3.5–4.5 for modern subword tokenizers. Code, JSON, and URLs compress worse — sometimes under 2 characters per token — because rare symbols split into byte-level pieces.

Special tokens and chat templates

Beyond subwords, vocabularies reserve IDs for control symbols:

  • Boundary tokens — BOS, EOS, PAD; define sequence start, end, and batch padding.
  • Mask / fill tokens — used in BERT-style denoising and some infilling models.
  • Role and tool tokens<|user|>, <|assistant|>, <|tool_call|> in instruction-tuned models; must be emitted exactly as the fine-tuning template expects.
  • Reserved / unused slots — some checkpoints leave gaps for later vocabulary expansion.

Chat templates (Jinja strings in Hugging Face tokenizer_config.json) interleave these special tokens with user content. A template mismatch — wrong newline, missing role header — does not change the UTF-8 string you log but can change token IDs and model behavior. Always tokenize through the model’s official template, not a hand-built string.

Multilingual and domain-specific fragmentation

Tokenizers trained primarily on English allocate short encodings to common English morphemes. The same byte length in Japanese, Arabic, or Hindi often maps to 2–4x more tokens. Effects:

  • Context shrink — a “32k context” window holds less actual content in CJK than in English.
  • Cost skew — per-token billing penalizes languages the vocabulary under-represents.
  • Quality skew — more tokens per morpheme means more autoregressive steps and higher error rates on equal byte budgets.

Mitigations include language-specific models, vocabularies trained on balanced multilingual corpora, and query routing that detects script and picks a tokenizer-friendly model path. For cross-lingual RAG, see multilingual RAG patterns that account for per-language token inflation in chunk sizing.

Counting tokens in production

Never estimate LLM cost from word count or character count alone. Use the model’s official tokenizer:

  • OpenAI / tiktokentiktoken.encoding_for_model() for GPT-family models; encoding names like cl100k_base and o200k_base are not interchangeable.
  • Hugging FaceAutoTokenizer.from_pretrained() then tokenizer.encode(text, add_special_tokens=True).
  • API providers — many expose a /tokenize debug endpoint; use it when building pre-flight budget gates.

Pre-flight counting should run on the fully rendered prompt including system message, retrieved RAG chunks, JSON schema instructions, and tool definitions. Harbor Analytics found that tool schema blocks alone added 800–1,200 tokens — invisible if you count only the user question string.

Harbor Analytics cost-router refactor (worked example)

Problem: The support cascade billed per input token per model tier, but routing logic used character-length heuristics. Escalation thresholds were wrong for JSON-heavy tickets and Japanese tickets.

  1. Instrument — log token_count, tokenizer_name, and chars_per_token on every request for two weeks.
  2. Discover skew — classifier path averaged 3.1 chars/token on English but 1.4 on Japanese; escalation model held 3.8 and 2.6 respectively.
  3. Normalize budgets — replace char limits with per-model token budgets computed via each model’s tokenizer at route time.
  4. Align templates — unify ticket serialization to a compact JSON line format that tokenizes efficiently across all three vocabularies (removed pretty-print indentation).
  5. Cache keys — prefix cache hashes now use tokenizer-specific byte sequences, not raw string hashes, so prompt cache hits stay valid per model.

Outcome: 17% reduction in total input tokens, escalation rate unchanged, no accuracy regression on the holdout set. The fix was tokenizer awareness, not a smarter classifier.

Technique decision table

Goal Prefer Why not the alternative
Estimate API cost before sending Model-specific tokenizer encode() Word/char heuristics drift 30%+ on code and CJK
Maximize English prose throughput Large-vocab BPE / SentencePiece model Character-level sequences are too long
Handle arbitrary Unicode without UNK Byte-level BPE WordPiece may map rare chars to UNK
Share tokenizer between teacher and student Same model family / merged vocab Cross-vocab distillation breaks logit alignment
Balanced multilingual SaaS Multilingual-trained vocab or per-locale routing English-centric 32k vocab inflates non-Latin bills
Stable encoding for log diffs Pin tokenizer version in model manifest Provider silent vocab updates change token counts
Debug prompt bloat Token-level visualization (color by token) Reading raw JSON hides merge boundaries

Common pitfalls

  • Wrong encoding namecl100k_base vs o200k_base off by dozens of tokens on the same string.
  • Forgetting special tokensadd_special_tokens=False undercounts chat prompts by 10–40 tokens.
  • Tokenizer mismatch in distillation — teacher and student vocabularies must align for logit-level training.
  • Pretty-printed JSON in prompts — whitespace and newlines each consume tokens; minify machine-readable blocks.
  • Assuming equal context across languages — a 8k Japanese document may exceed a 32k window after tokenization.
  • Hardcoded string splits — splitting on spaces fails for languages without whitespace delimiters.
  • Ignoring vocabulary updates — fine-tunes that add tokens require embedding resizing and template updates.
  • Reusing cache across models — different token IDs for the same UTF-8 string invalidate KV reuse.

Production checklist

  • Pin tokenizer name, revision, and chat template in the model deployment manifest.
  • Pre-flight count tokens with the target model’s tokenizer on the full rendered prompt.
  • Log tokens in, tokens out, and chars-per-token by language/script for cost dashboards.
  • Minify JSON, XML, and tool schemas embedded in prompts; avoid decorative whitespace.
  • Test escalation thresholds on non-English and code-heavy samples, not English prose only.
  • Validate chat templates against golden token ID sequences after every model upgrade.
  • Size RAG chunks by tokenizer count, not character count, per target model.
  • When distilling or merging models, confirm vocabulary compatibility first.
  • Expose a debug tokenize endpoint for support engineers investigating bloat.
  • Alert when chars-per-token drops below 2.0 on production traffic (likely binary or CJK skew).

Key takeaways

  • Tokenizers define the real input to an LLM; identical UTF-8 strings can produce very different token counts across models.
  • Byte-level BPE, WordPiece, and SentencePiece each trade vocabulary size, multilingual coverage, and sequence length differently.
  • Production cost and context budgeting require model-specific token counting on fully rendered prompts.
  • English-centric vocabularies inflate tokens and cost for CJK, Arabic, and dense structured data.
  • Harbor Analytics cut input tokens 17% by routing on per-model tokenizer counts instead of character heuristics.

Related reading