Guide

LLM grammar-constrained decoding explained

Harbor Commerce's voice-order assistant transcribed “two large oat lattes and a blueberry muffin” into a cart JSON blob the fulfillment API could ingest. Prompt instructions demanded valid JSON; a repair loop re-prompted on parse failure. In production, 6.8% of orders still failed validation — trailing commas after the last line item, quantity emitted as the string "two" instead of 2, and enum values like "large" when the schema required "L". Each repair added 1.2 seconds and doubled token cost on the failure tail. The refactor moved constraint enforcement from post-generation parsing to grammar-constrained decoding: at every sampling step, illegal tokens are masked before the model can choose them. Parse failure rate dropped to 0.04%; p95 latency on the order path fell 41%. Grammar constraints guarantee syntax — not factual correctness — but syntax errors are what crash typed pipelines before business logic runs.

Constrained decoding sits one layer below structured outputs APIs and function calling: instead of hoping the model follows a schema, you compile the schema into a grammar (finite-state machine, context-free grammar, or JSON automaton) and filter the vocabulary at each token. This guide covers how logit masking works, grammar formats (GBNF, regex, JSON Schema), serving integrations (Outlines, llguidance, XGrammar), the Harbor Commerce refactor, a technique decision table vs prompt-only JSON mode and repair loops, pitfalls, and a production checklist.

What grammar-constrained decoding is

Grammar-constrained decoding (also called constrained generation or guided decoding) restricts which tokens an LLM may emit at each step of autoregressive sampling. After the model produces logits (unnormalized scores for every vocabulary token), a constraint engine computes the set of legal next tokens given the partial output so far and a declared grammar. All illegal logits are set to negative infinity (or masked out); sampling or greedy argmax proceeds only over legal tokens.

The result is output that is guaranteed to match the grammar by construction — valid JSON, a specific regex, a SQL subset, or a custom DSL. This differs from:

Prompt instructions — probabilistic compliance; no hard guarantee.
Post-hoc validation + repair — detects errors after generation; costs extra round-trips.
Provider structured-output modes — often implement constrained decoding server-side; you may not see the masking layer.

Constrained decoding does not prevent hallucinated field values inside valid syntax. A JSON object with a fabricated SKU is still syntactically valid. Pair constraints with retrieval, tool calls, and verification where facts matter.

How logit masking works step by step

Compile grammar — JSON Schema, GBNF, or regex is transformed into an automaton (FSM for regex; pushdown or specialized JSON automaton for nested structures).
Track parser state — after each emitted token, update the automaton: e.g., inside a string, after a colon, expecting a number.
Compute allowed token set — for the current state, which vocabulary tokens continue a legal parse? Tokenizers complicate this: one logical character may span multiple byte-level tokens, so engines precompute token-prefix trees.
Mask and sample — zero illegal logits; apply temperature, top-p, or greedy selection on the remainder.
Repeat until the automaton reaches an accepting state (complete JSON object, end of regex match, etc.).

Overhead is usually modest for JSON and regex (microseconds to low milliseconds per step on CPU). Deeply nested grammars or very large vocabularies can add measurable latency; profile on your target hardware and model tokenizer.

Grammar formats and tools

JSON Schema and typed objects

The most common production use case: force output to match a JSON Schema with required keys, enum values, numeric types, and array bounds. Libraries compile schema into token masks so the model cannot emit "two" where an integer is required or invent enum strings outside the allowed set.

GBNF (GGML BNF)

A BNF-like notation used by llama.cpp and compatible servers. Good for custom mini-languages, SQL subsets, and configuration DSLs. Example rule: root ::= object; object ::= "{" pair ("," pair)* "}"; Hand-authored grammars give maximum control but require maintenance when schemas change.

Regular expressions

Constrain output to match a regex — dates (\d{4}-\d{2}-\d{2}), UUIDs, ISO country codes. Simpler than full JSON when the output is a single token or line.

Ecosystem libraries

Outlines — Python; JSON Schema, regex, and CFG constraints with Hugging Face and vLLM integrations.
llguidance — fast Rust core; used by vLLM, SGLang, and others for low-latency masking.
XGrammar — GPU-friendly JSON and schema masking for high-throughput serving.
Guidance — templated generation with interleaved fixed text and constrained regions.

Hosted APIs (OpenAI structured outputs, Anthropic tool schemas, Google response schemas) increasingly expose the same guarantee without self-hosting the mask layer. See LLM inference serving for where constraints plug into vLLM, TGI, and custom stacks.

Constrained decoding vs alternatives

Approach	Guarantee	Latency / cost	Best when
Prompt-only JSON	None (probabilistic)	Lowest single pass	Prototypes, human-reviewed output
Repair loop	Eventually valid if model cooperates	1–3× tokens on failures	Low-volume, rare failures acceptable
Grammar-constrained decoding	Syntactic validity by construction	Small per-step CPU overhead	High-volume typed pipelines, agents
Provider structured outputs	Schema compliance (vendor-dependent)	Billed as normal completion	No self-hosted inference
Function / tool calling	Argument shape + routing to tools	Extra schema tokens in prompt	Model must invoke external APIs

Harbor Commerce order parser refactor

The voice-order pipeline before refactor:

Whisper transcription → GPT-class model with JSON prompt → JSON.parse → on failure, repair prompt with error message → retry up to 2 times → Pydantic validation → cart API.

After refactor:

Schema as single source of truth — Pydantic model exported to JSON Schema; same schema drives API validation and decoding grammar.
vLLM + llguidance — constraint mask applied on every decode step; enums for size (S|M|L) and milk type enforced at token level.
Streaming disabled for this path — batch JSON is small; partial-stream parsing was removed in favor of atomic objects.
Semantic validation retained — SKU lookup and price check still run after parse; constraints do not prove the item exists in catalog.
Fallback tier — if ASR confidence < 0.72, route to human confirmation UI instead of loosening grammar.

Parse failure rate 6.8% → 0.04%; repair-related token spend eliminated on 94% of traffic; p95 end-to-end latency 3.8 s → 2.2 s. Catalog mismatch errors (wrong SKU) unchanged — those are retrieval problems, not syntax problems.

Tokenizer alignment challenges

The hardest engineering problem in constrained decoding is not the grammar — it is the tokenizer. LLMs emit subword tokens, not characters. A legal continuation might require token "2" but the model might otherwise prefer " two" (leading space + word). Good constraint engines:

Precompute a map from automaton states to allowed token IDs.
Handle byte-level BPE edge cases (incomplete UTF-8 sequences).
Backtrack or reject partial tokens that cannot complete a legal string.

When evaluating libraries, test on your exact model and tokenizer — masks that work for Llama may need retuning for Qwen or Mistral vocabularies.

Technique decision table

Approach	Best when	Skip when
Grammar-constrained JSON	Agent plans, order forms, config generation at scale	Free-form creative writing, markdown articles
Regex constraint	Single-field formats (date, ID, phone)	Nested objects with optional keys
Custom CFG / GBNF	SQL subsets, DSLs, log query languages	Team lacks grammar maintenance discipline
Provider structured outputs	Managed API, no self-hosted GPU	Need on-prem, custom fine-tunes, or sub-10 ms mask control
Repair loop only	<1% failure rate, low volume	SLO-sensitive agents with cascading tool calls

Common pitfalls

Schema drift — API schema updates but grammar cache does not; version schemas and invalidate compiled grammars on deploy.
Over-constraining creativity — forcing JSON for user-facing prose produces stiff answers; constrain only machine-facing stages.
Assuming semantic correctness — valid JSON with hallucinated IDs still fails downstream; keep business validation.
Ignoring tokenizer edge cases — rare invalid UTF-8 or split numbers; fuzz-test constraints with property-based generators.
Streaming + constraints mismatch — some stacks buffer until a complete legal fragment exists; document client behavior.
Latency surprise on huge enums — thousands of allowed SKU strings inflate mask computation; use ID indirection or retrieval first.
Duplicate schema definitions — Pydantic, OpenAPI, and grammar compiled separately; generate all from one source file.

Production checklist

Export JSON Schema from the same types your API validates.
Integrate constraint engine with your inference server (vLLM, llama.cpp, API).
Measure parse success rate, mask overhead ms/token, and end-to-end latency.
Fuzz-test grammars with random valid and near-valid inputs.
Keep a repair fallback for model/provider outages, not routine failures.
Log constraint violations separately from semantic business errors.
Version and cache compiled grammars; rebuild on schema change.
Document which fields are syntax-guaranteed vs fact-checked post-parse.
For agents, constrain planner output; allow natural language only in user channel.
Load-test enum-heavy schemas; consider short IDs resolved after parse.

Key takeaways

Grammar-constrained decoding masks illegal tokens at each step so output matches a declared grammar by construction.
Harbor Commerce cut voice-order parse failures from 6.8% to 0.04% and removed repair-loop token waste on most traffic.
Constraints guarantee syntax, not truth — keep semantic validation and retrieval for catalog facts.
Tokenizer alignment is the hard part — test masks on your exact model vocabulary.
Generate grammars from a single schema source to avoid API and mask drift.