Guide
LLM grammar-constrained decoding explained
Harbor Commerce's voice-order assistant transcribed “two large oat lattes
and a blueberry muffin” into a cart JSON blob the fulfillment API could ingest.
Prompt instructions demanded valid JSON; a repair loop re-prompted on parse failure.
In production, 6.8% of orders still failed validation — trailing commas after the
last line item, quantity emitted as the string "two"
instead of 2, and enum values like "large" when the
schema required "L". Each repair added 1.2 seconds and doubled
token cost on the failure tail. The refactor moved constraint enforcement from
post-generation parsing to grammar-constrained decoding: at every
sampling step, illegal tokens are masked before the model can choose them. Parse
failure rate dropped to 0.04%; p95 latency on the order path fell 41%. Grammar
constraints guarantee syntax — not factual correctness — but
syntax errors are what crash typed pipelines before business logic runs.
Constrained decoding sits one layer below structured outputs APIs and function calling: instead of hoping the model follows a schema, you compile the schema into a grammar (finite-state machine, context-free grammar, or JSON automaton) and filter the vocabulary at each token. This guide covers how logit masking works, grammar formats (GBNF, regex, JSON Schema), serving integrations (Outlines, llguidance, XGrammar), the Harbor Commerce refactor, a technique decision table vs prompt-only JSON mode and repair loops, pitfalls, and a production checklist.
What grammar-constrained decoding is
Grammar-constrained decoding (also called constrained generation or guided decoding) restricts which tokens an LLM may emit at each step of autoregressive sampling. After the model produces logits (unnormalized scores for every vocabulary token), a constraint engine computes the set of legal next tokens given the partial output so far and a declared grammar. All illegal logits are set to negative infinity (or masked out); sampling or greedy argmax proceeds only over legal tokens.
The result is output that is guaranteed to match the grammar by construction — valid JSON, a specific regex, a SQL subset, or a custom DSL. This differs from:
- Prompt instructions — probabilistic compliance; no hard guarantee.
- Post-hoc validation + repair — detects errors after generation; costs extra round-trips.
- Provider structured-output modes — often implement constrained decoding server-side; you may not see the masking layer.
Constrained decoding does not prevent hallucinated field values inside valid syntax. A JSON object with a fabricated SKU is still syntactically valid. Pair constraints with retrieval, tool calls, and verification where facts matter.
How logit masking works step by step
- Compile grammar — JSON Schema, GBNF, or regex is transformed into an automaton (FSM for regex; pushdown or specialized JSON automaton for nested structures).
- Track parser state — after each emitted token, update the automaton: e.g., inside a string, after a colon, expecting a number.
- Compute allowed token set — for the current state, which vocabulary tokens continue a legal parse? Tokenizers complicate this: one logical character may span multiple byte-level tokens, so engines precompute token-prefix trees.
- Mask and sample — zero illegal logits; apply temperature, top-p, or greedy selection on the remainder.
- Repeat until the automaton reaches an accepting state (complete JSON object, end of regex match, etc.).
Overhead is usually modest for JSON and regex (microseconds to low milliseconds per step on CPU). Deeply nested grammars or very large vocabularies can add measurable latency; profile on your target hardware and model tokenizer.
Grammar formats and tools
JSON Schema and typed objects
The most common production use case: force output to match a JSON Schema with
required keys, enum values, numeric types, and array bounds. Libraries compile
schema into token masks so the model cannot emit "two" where
an integer is required or invent enum strings outside the allowed set.
GBNF (GGML BNF)
A BNF-like notation used by llama.cpp and compatible servers. Good for custom
mini-languages, SQL subsets, and configuration DSLs. Example rule:
root ::= object; object ::= "{" pair ("," pair)* "}";
Hand-authored grammars give maximum control but require maintenance when schemas
change.
Regular expressions
Constrain output to match a regex — dates (\d{4}-\d{2}-\d{2}),
UUIDs, ISO country codes. Simpler than full JSON when the output is a single
token or line.
Ecosystem libraries
- Outlines — Python; JSON Schema, regex, and CFG constraints with Hugging Face and vLLM integrations.
- llguidance — fast Rust core; used by vLLM, SGLang, and others for low-latency masking.
- XGrammar — GPU-friendly JSON and schema masking for high-throughput serving.
- Guidance — templated generation with interleaved fixed text and constrained regions.
Hosted APIs (OpenAI structured outputs, Anthropic tool schemas, Google response schemas) increasingly expose the same guarantee without self-hosting the mask layer. See LLM inference serving for where constraints plug into vLLM, TGI, and custom stacks.
Constrained decoding vs alternatives
| Approach | Guarantee | Latency / cost | Best when |
|---|---|---|---|
| Prompt-only JSON | None (probabilistic) | Lowest single pass | Prototypes, human-reviewed output |
| Repair loop | Eventually valid if model cooperates | 1–3× tokens on failures | Low-volume, rare failures acceptable |
| Grammar-constrained decoding | Syntactic validity by construction | Small per-step CPU overhead | High-volume typed pipelines, agents |
| Provider structured outputs | Schema compliance (vendor-dependent) | Billed as normal completion | No self-hosted inference |
| Function / tool calling | Argument shape + routing to tools | Extra schema tokens in prompt | Model must invoke external APIs |
Harbor Commerce order parser refactor
The voice-order pipeline before refactor:
- Whisper transcription → GPT-class model with JSON prompt →
JSON.parse→ on failure, repair prompt with error message → retry up to 2 times → Pydantic validation → cart API.
After refactor:
- Schema as single source of truth — Pydantic model exported to JSON Schema; same schema drives API validation and decoding grammar.
- vLLM + llguidance — constraint mask applied on every
decode step; enums for size (
S|M|L) and milk type enforced at token level. - Streaming disabled for this path — batch JSON is small; partial-stream parsing was removed in favor of atomic objects.
- Semantic validation retained — SKU lookup and price check still run after parse; constraints do not prove the item exists in catalog.
- Fallback tier — if ASR confidence < 0.72, route to human confirmation UI instead of loosening grammar.
Parse failure rate 6.8% → 0.04%; repair-related token spend eliminated on 94% of traffic; p95 end-to-end latency 3.8 s → 2.2 s. Catalog mismatch errors (wrong SKU) unchanged — those are retrieval problems, not syntax problems.
Tokenizer alignment challenges
The hardest engineering problem in constrained decoding is not the grammar —
it is the tokenizer. LLMs emit subword tokens, not characters.
A legal continuation might require token "2" but the model
might otherwise prefer " two" (leading space + word). Good
constraint engines:
- Precompute a map from automaton states to allowed token IDs.
- Handle byte-level BPE edge cases (incomplete UTF-8 sequences).
- Backtrack or reject partial tokens that cannot complete a legal string.
When evaluating libraries, test on your exact model and tokenizer — masks that work for Llama may need retuning for Qwen or Mistral vocabularies.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Grammar-constrained JSON | Agent plans, order forms, config generation at scale | Free-form creative writing, markdown articles |
| Regex constraint | Single-field formats (date, ID, phone) | Nested objects with optional keys |
| Custom CFG / GBNF | SQL subsets, DSLs, log query languages | Team lacks grammar maintenance discipline |
| Provider structured outputs | Managed API, no self-hosted GPU | Need on-prem, custom fine-tunes, or sub-10 ms mask control |
| Repair loop only | <1% failure rate, low volume | SLO-sensitive agents with cascading tool calls |
Common pitfalls
- Schema drift — API schema updates but grammar cache does not; version schemas and invalidate compiled grammars on deploy.
- Over-constraining creativity — forcing JSON for user-facing prose produces stiff answers; constrain only machine-facing stages.
- Assuming semantic correctness — valid JSON with hallucinated IDs still fails downstream; keep business validation.
- Ignoring tokenizer edge cases — rare invalid UTF-8 or split numbers; fuzz-test constraints with property-based generators.
- Streaming + constraints mismatch — some stacks buffer until a complete legal fragment exists; document client behavior.
- Latency surprise on huge enums — thousands of allowed SKU strings inflate mask computation; use ID indirection or retrieval first.
- Duplicate schema definitions — Pydantic, OpenAPI, and grammar compiled separately; generate all from one source file.
Production checklist
- Export JSON Schema from the same types your API validates.
- Integrate constraint engine with your inference server (vLLM, llama.cpp, API).
- Measure parse success rate, mask overhead ms/token, and end-to-end latency.
- Fuzz-test grammars with random valid and near-valid inputs.
- Keep a repair fallback for model/provider outages, not routine failures.
- Log constraint violations separately from semantic business errors.
- Version and cache compiled grammars; rebuild on schema change.
- Document which fields are syntax-guaranteed vs fact-checked post-parse.
- For agents, constrain planner output; allow natural language only in user channel.
- Load-test enum-heavy schemas; consider short IDs resolved after parse.
Key takeaways
- Grammar-constrained decoding masks illegal tokens at each step so output matches a declared grammar by construction.
- Harbor Commerce cut voice-order parse failures from 6.8% to 0.04% and removed repair-loop token waste on most traffic.
- Constraints guarantee syntax, not truth — keep semantic validation and retrieval for catalog facts.
- Tokenizer alignment is the hard part — test masks on your exact model vocabulary.
- Generate grammars from a single schema source to avoid API and mask drift.
Related reading
- LLM structured outputs explained — JSON Schema APIs and validation pipelines
- LLM function calling explained — tool routing and argument schemas
- LLM sampling and decoding strategies explained — temperature, top-p, and greedy search under masks
- LLM inference serving explained — where constrained decoding plugs into production stacks