Guide
LLM fill-in-the-middle explained
Harbor Analytics' internal VS Code extension logged a 38% “reject” rate on tab completions: developers placed the cursor inside an existing function to add a null-check, but the model — trained only for left-to-right continuation — kept appending code after the closing brace instead of filling the gap. Switching the backend to a fill-in-the-middle (FIM) checkpoint (StarCoder-style PSM layout) and sending prefix + suffix around the cursor cut rejects to 11% and lifted accepted-line volume 2.4×. FIM is the training and inference pattern that teaches a transformer to predict missing spans between known prefix and suffix context — the default mode for modern IDE copilots, diff-aware edits, and structured hole-filling.
Standard causal language modeling sees only tokens to the left of the prediction point. That works for chat and greenfield generation but breaks when the user's cursor sits mid-file with closing brackets, imports, and tests already written below. FIM rearranges the input sequence during training and serving so the model conditions on both sides of the hole. This guide covers span-masking pretraining, PSM vs SPM token layouts, special FIM tokens, serving infill requests in vLLM and vendor APIs, the Harbor Analytics IDE refactor, a technique decision table against prefix-only completion and LoRA fine-tunes, pitfalls, and a production checklist.
What FIM solves
Prefix-only models assume everything unknown lies to the right of the cursor. Real editing workflows violate that constantly:
- Mid-function inserts — add error handling between an
opening
tryand an existingexceptblock. - Bracket-aware completion — the suffix already contains
);or}}; the model must not duplicate closers. - Multi-cursor and diff hunks — only a middle slice is missing; prefix and suffix are both authoritative.
- Template holes — SQL, HTML, or config files with fixed scaffolding and one slot to fill (overlaps with structured output patterns for JSON fields).
FIM does not replace chat or agent loops; it optimizes a narrow, high-frequency path: single-location infill with tight latency budgets (<200 ms perceived in IDEs). Chat models can approximate infill by stuffing suffix into the prompt (“complete only the middle; here is what comes after”), but dedicated FIM training is more reliable and token-efficient at scale.
How FIM training works
During pretraining or continued pretraining on code corpora, a random contiguous span is selected and removed from each sample. The model learns to predict that span given surrounding context. Two layout conventions dominate:
PSM (prefix – suffix – middle)
Input order: [prefix tokens] [suffix tokens] [middle tokens to predict].
StarCoder, CodeLlama, and many Hugging Face code models wrap segments with special
tokens such as <fim_prefix>, <fim_suffix>,
<fim_middle> (exact strings vary by tokenizer). Attention masks
ensure the middle is predicted causally within its block while attending to both flanks.
PSM is the de facto standard for IDE serving because the suffix is presented before
generation begins, biasing the model toward compatible closers.
SPM (suffix – prefix – middle)
Reorders to [suffix] [prefix] [middle]. Some research checkpoints report
slightly better perplexity on certain languages; production stacks rarely expose SPM
unless the base weights were trained for it. Always match inference layout to training.
Span selection policy
- Random span length — uniform or log-uniform between a few tokens and ~50% of file length; prevents the model from only learning tiny holes.
- AST-aware spans — mask whole statements or function bodies so infill respects syntax boundaries (common in code-specific curricula).
- FIM rate — typically 50% FIM / 50% standard left-to-right in the same run so the model retains chat-style continuation ability.
FIM is usually applied during base pretraining or a code-heavy continued-pretrain phase, not as a small SFT patch on a chat-only model. Adding FIM to an instruction-tuned model without code re-pretrain yields weak infill.
Inference: formatting an infill request
At serving time the client sends three strings:
- Prefix — all text from file start through the cursor (or hole start).
- Suffix — all text from cursor (or hole end) through a sensible horizon (rest of function, rest of file, or capped token budget).
- Max middle tokens — stop condition for the generated span.
The server concatenates with FIM special tokens, runs a single decode pass (or speculative decode with a draft model), and returns only the middle segment — clients must not append the suffix again. Prefix and suffix lengths trade off against each other under the context window; a common IDE policy caps suffix at 30–40% of context and prefix at 50–60%, reserving headroom for generation and KV cache.
| Mode | Input shape | Typical use |
|---|---|---|
| PSM infill | fim_prefix + prefix + fim_suffix + suffix + fim_middle |
IDE tab, inline ghost text |
| Prefix-only | prefix (causal) |
New file bootstrap, terminal shells |
| Prompt-wrapped infill | Natural-language instructions + prefix + suffix in chat template | General chat models without native FIM tokens |
vLLM, TGI, and several cloud APIs expose an extra_body or
suffix field on completion endpoints when the loaded checkpoint includes
FIM weights. If the endpoint ignores suffix, you are on prefix-only paths regardless
of marketing copy.
Latency, batching, and quality controls
IDE completion is latency-sensitive. Production patterns:
- Debounce and cancel — fire infill 80–120 ms after keystroke pause; abort in-flight requests on the next key.
- Suffix truncation — send only from cursor to next blank line or closing brace at same indent; reduces noise from distant imports.
- Stop sequences — stop on suffix overlap detection (if generated text matches suffix prefix, trim) to prevent duplicate closers.
- Logprob threshold — hide low-confidence ghosts; empty suggestions beat wrong inserts.
- Repo context optional — RAG over related files is usually a separate, slower path; keep hot-path FIM to single-buffer prefix/suffix.
Throughput-oriented inference serving batches infill requests like any decode job; PSM layouts are compatible with continuous batching when prefix/suffix lengths vary per seat.
Harbor Analytics IDE refactor (worked example)
- Baseline — 7B instruction model, prefix-only API; 38% user rejects on mid-function triggers (telemetry on Tab key).
- Model swap — 7B code checkpoint with 50% FIM pretrain (PSM tokens in tokenizer); same GPU pool on vLLM.
- Client protocol — extension sends
prefix,suffix(to end of scope block),max_tokens=64; server returns middle only. - Suffix policy — tree-sitter finds enclosing block end; fallback cap 512 tokens.
- Stop logic — trim generation when Levenshtein match to suffix start > 80%.
- Results — reject rate 38% → 11%; median latency 142 ms → 158 ms (+16 ms acceptable); accepted lines per session +140%.
- Rollback — feature flag per workspace routes to legacy prefix endpoint.
They did not fine-tune on private repos in v1; FIM capability came entirely from the base checkpoint. A later LoRA on internal APIs improved domain-specific infill another 9% on accept rate.
Technique decision table
| Approach | Best when | Trade-off |
|---|---|---|
| Prefix-only completion | New files, shells, chat, streaming drafts | Poor mid-buffer inserts; duplicates suffix tokens |
| Native FIM (PSM) | IDE tab, inline edit, hole in known scaffold | Needs FIM-trained weights and correct special tokens |
| Chat prompt infill | Quick prototype without code-specific model | Higher token cost, weaker bracket discipline |
| LoRA on FIM base | Private API style, internal DSL, org conventions | Data curation; risk of breaking FIM if SFT overwrites layout |
| Diff/patch models | Multi-hunk edits, whole-file refactors | Higher latency; different UX than single-cursor infill |
| Retrieval + prefix | Cross-file copy patterns, boilerplate from monorepo | RAG latency; does not fix suffix awareness alone |
Common pitfalls
- Wrong token order — PSM weights with SPM serving (or
missing
fim_middle) produces garbage or empty middles. - Oversized suffix — entire file below cursor blows context and dilutes attention; truncate intelligently.
- Chat template on code FIM — wrapping FIM inputs in
<|user|>chat markers not seen during FIM pretrain degrades quality. - No FIM in base model — instruction-tuned chat models lack infill unless explicitly re-trained; suffix fields are ignored.
- Duplicating suffix in UI — client appends suffix after model output; double closing braces and semicolons.
- Ignoring stop-on-suffix — model runs into suffix text logically; users see overlapping completions.
- Multilingual mismatch — FIM spans trained mostly on Python generalize poorly to Terraform or SQL without balanced span sampling.
- Eval only on line-end — benchmarks that measure only end-of-line continuation miss the mid-buffer failures FIM fixes.
Production checklist
- Confirm base checkpoint was trained with FIM (PSM or SPM) and note token strings.
- Implement prefix + suffix extraction with syntax-aware suffix bounds.
- Cap prefix/suffix shares of context window; reserve tokens for middle generation.
- Return middle only; never concatenate suffix on the client.
- Add stop-on-suffix overlap and max-middle token limits.
- Debounce, cancel stale requests, and hide low-logprob suggestions.
- Log accept/reject telemetry separately for mid-buffer vs end-of-line triggers.
- Match serving layout to training; version tokenizer special tokens with model hash.
- Evaluate on held-out mid-function holes, not just HumanEval-style prompts.
- Feature-flag fallback to prefix-only for models or languages without FIM support.
Key takeaways
- FIM trains models to predict a missing middle span conditioned on both prefix and suffix — essential for IDE tab completion inside existing code.
- PSM layout (prefix, suffix, then middle) with fim_* special tokens is the standard serving format for StarCoder-class checkpoints.
- Harbor Analytics cut tab-completion reject rate from 38% to 11% by switching to a native FIM model and syntax-bounded suffixes.
- Prefix-only chat models can approximate infill with prompt tricks but dedicated FIM pretrain is more reliable and token-efficient.
- Truncate suffixes, stop on overlap, and measure mid-buffer accept rate — end-of-line metrics hide the problem FIM solves.
Related reading
- LLM fine-tuning explained — when to adapt code models with LoRA and SFT
- LLM tokenization explained — BPE, special tokens, and context costs
- LLM inference serving explained — batching, latency, and production stacks
- LLM structured outputs explained — JSON Schema holes and constrained decoding