Guide

LLM fill-in-the-middle explained

Harbor Analytics' internal VS Code extension logged a 38% “reject” rate on tab completions: developers placed the cursor inside an existing function to add a null-check, but the model — trained only for left-to-right continuation — kept appending code after the closing brace instead of filling the gap. Switching the backend to a fill-in-the-middle (FIM) checkpoint (StarCoder-style PSM layout) and sending prefix + suffix around the cursor cut rejects to 11% and lifted accepted-line volume 2.4×. FIM is the training and inference pattern that teaches a transformer to predict missing spans between known prefix and suffix context — the default mode for modern IDE copilots, diff-aware edits, and structured hole-filling.

Standard causal language modeling sees only tokens to the left of the prediction point. That works for chat and greenfield generation but breaks when the user's cursor sits mid-file with closing brackets, imports, and tests already written below. FIM rearranges the input sequence during training and serving so the model conditions on both sides of the hole. This guide covers span-masking pretraining, PSM vs SPM token layouts, special FIM tokens, serving infill requests in vLLM and vendor APIs, the Harbor Analytics IDE refactor, a technique decision table against prefix-only completion and LoRA fine-tunes, pitfalls, and a production checklist.

What FIM solves

Prefix-only models assume everything unknown lies to the right of the cursor. Real editing workflows violate that constantly:

  • Mid-function inserts — add error handling between an opening try and an existing except block.
  • Bracket-aware completion — the suffix already contains ); or }}; the model must not duplicate closers.
  • Multi-cursor and diff hunks — only a middle slice is missing; prefix and suffix are both authoritative.
  • Template holes — SQL, HTML, or config files with fixed scaffolding and one slot to fill (overlaps with structured output patterns for JSON fields).

FIM does not replace chat or agent loops; it optimizes a narrow, high-frequency path: single-location infill with tight latency budgets (<200 ms perceived in IDEs). Chat models can approximate infill by stuffing suffix into the prompt (“complete only the middle; here is what comes after”), but dedicated FIM training is more reliable and token-efficient at scale.

How FIM training works

During pretraining or continued pretraining on code corpora, a random contiguous span is selected and removed from each sample. The model learns to predict that span given surrounding context. Two layout conventions dominate:

PSM (prefix – suffix – middle)

Input order: [prefix tokens] [suffix tokens] [middle tokens to predict]. StarCoder, CodeLlama, and many Hugging Face code models wrap segments with special tokens such as <fim_prefix>, <fim_suffix>, <fim_middle> (exact strings vary by tokenizer). Attention masks ensure the middle is predicted causally within its block while attending to both flanks. PSM is the de facto standard for IDE serving because the suffix is presented before generation begins, biasing the model toward compatible closers.

SPM (suffix – prefix – middle)

Reorders to [suffix] [prefix] [middle]. Some research checkpoints report slightly better perplexity on certain languages; production stacks rarely expose SPM unless the base weights were trained for it. Always match inference layout to training.

Span selection policy

  • Random span length — uniform or log-uniform between a few tokens and ~50% of file length; prevents the model from only learning tiny holes.
  • AST-aware spans — mask whole statements or function bodies so infill respects syntax boundaries (common in code-specific curricula).
  • FIM rate — typically 50% FIM / 50% standard left-to-right in the same run so the model retains chat-style continuation ability.

FIM is usually applied during base pretraining or a code-heavy continued-pretrain phase, not as a small SFT patch on a chat-only model. Adding FIM to an instruction-tuned model without code re-pretrain yields weak infill.

Inference: formatting an infill request

At serving time the client sends three strings:

  1. Prefix — all text from file start through the cursor (or hole start).
  2. Suffix — all text from cursor (or hole end) through a sensible horizon (rest of function, rest of file, or capped token budget).
  3. Max middle tokens — stop condition for the generated span.

The server concatenates with FIM special tokens, runs a single decode pass (or speculative decode with a draft model), and returns only the middle segment — clients must not append the suffix again. Prefix and suffix lengths trade off against each other under the context window; a common IDE policy caps suffix at 30–40% of context and prefix at 50–60%, reserving headroom for generation and KV cache.

Mode Input shape Typical use
PSM infill fim_prefix + prefix + fim_suffix + suffix + fim_middle IDE tab, inline ghost text
Prefix-only prefix (causal) New file bootstrap, terminal shells
Prompt-wrapped infill Natural-language instructions + prefix + suffix in chat template General chat models without native FIM tokens

vLLM, TGI, and several cloud APIs expose an extra_body or suffix field on completion endpoints when the loaded checkpoint includes FIM weights. If the endpoint ignores suffix, you are on prefix-only paths regardless of marketing copy.

Latency, batching, and quality controls

IDE completion is latency-sensitive. Production patterns:

  • Debounce and cancel — fire infill 80–120 ms after keystroke pause; abort in-flight requests on the next key.
  • Suffix truncation — send only from cursor to next blank line or closing brace at same indent; reduces noise from distant imports.
  • Stop sequences — stop on suffix overlap detection (if generated text matches suffix prefix, trim) to prevent duplicate closers.
  • Logprob threshold — hide low-confidence ghosts; empty suggestions beat wrong inserts.
  • Repo context optional — RAG over related files is usually a separate, slower path; keep hot-path FIM to single-buffer prefix/suffix.

Throughput-oriented inference serving batches infill requests like any decode job; PSM layouts are compatible with continuous batching when prefix/suffix lengths vary per seat.

Harbor Analytics IDE refactor (worked example)

  1. Baseline — 7B instruction model, prefix-only API; 38% user rejects on mid-function triggers (telemetry on Tab key).
  2. Model swap — 7B code checkpoint with 50% FIM pretrain (PSM tokens in tokenizer); same GPU pool on vLLM.
  3. Client protocol — extension sends prefix, suffix (to end of scope block), max_tokens=64; server returns middle only.
  4. Suffix policy — tree-sitter finds enclosing block end; fallback cap 512 tokens.
  5. Stop logic — trim generation when Levenshtein match to suffix start > 80%.
  6. Results — reject rate 38% → 11%; median latency 142 ms → 158 ms (+16 ms acceptable); accepted lines per session +140%.
  7. Rollback — feature flag per workspace routes to legacy prefix endpoint.

They did not fine-tune on private repos in v1; FIM capability came entirely from the base checkpoint. A later LoRA on internal APIs improved domain-specific infill another 9% on accept rate.

Technique decision table

Approach Best when Trade-off
Prefix-only completion New files, shells, chat, streaming drafts Poor mid-buffer inserts; duplicates suffix tokens
Native FIM (PSM) IDE tab, inline edit, hole in known scaffold Needs FIM-trained weights and correct special tokens
Chat prompt infill Quick prototype without code-specific model Higher token cost, weaker bracket discipline
LoRA on FIM base Private API style, internal DSL, org conventions Data curation; risk of breaking FIM if SFT overwrites layout
Diff/patch models Multi-hunk edits, whole-file refactors Higher latency; different UX than single-cursor infill
Retrieval + prefix Cross-file copy patterns, boilerplate from monorepo RAG latency; does not fix suffix awareness alone

Common pitfalls

  • Wrong token order — PSM weights with SPM serving (or missing fim_middle) produces garbage or empty middles.
  • Oversized suffix — entire file below cursor blows context and dilutes attention; truncate intelligently.
  • Chat template on code FIM — wrapping FIM inputs in <|user|> chat markers not seen during FIM pretrain degrades quality.
  • No FIM in base model — instruction-tuned chat models lack infill unless explicitly re-trained; suffix fields are ignored.
  • Duplicating suffix in UI — client appends suffix after model output; double closing braces and semicolons.
  • Ignoring stop-on-suffix — model runs into suffix text logically; users see overlapping completions.
  • Multilingual mismatch — FIM spans trained mostly on Python generalize poorly to Terraform or SQL without balanced span sampling.
  • Eval only on line-end — benchmarks that measure only end-of-line continuation miss the mid-buffer failures FIM fixes.

Production checklist

  • Confirm base checkpoint was trained with FIM (PSM or SPM) and note token strings.
  • Implement prefix + suffix extraction with syntax-aware suffix bounds.
  • Cap prefix/suffix shares of context window; reserve tokens for middle generation.
  • Return middle only; never concatenate suffix on the client.
  • Add stop-on-suffix overlap and max-middle token limits.
  • Debounce, cancel stale requests, and hide low-logprob suggestions.
  • Log accept/reject telemetry separately for mid-buffer vs end-of-line triggers.
  • Match serving layout to training; version tokenizer special tokens with model hash.
  • Evaluate on held-out mid-function holes, not just HumanEval-style prompts.
  • Feature-flag fallback to prefix-only for models or languages without FIM support.

Key takeaways

  • FIM trains models to predict a missing middle span conditioned on both prefix and suffix — essential for IDE tab completion inside existing code.
  • PSM layout (prefix, suffix, then middle) with fim_* special tokens is the standard serving format for StarCoder-class checkpoints.
  • Harbor Analytics cut tab-completion reject rate from 38% to 11% by switching to a native FIM model and syntax-bounded suffixes.
  • Prefix-only chat models can approximate infill with prompt tricks but dedicated FIM pretrain is more reliable and token-efficient.
  • Truncate suffixes, stop on overlap, and measure mid-buffer accept rate — end-of-line metrics hide the problem FIM solves.

Related reading