Guide

LLM attention sinks explained

Harbor Support's overnight chat agent handled tier-one tickets well for the first hour of each session. After roughly 8,000 tokens of back-and-forth, answers grew vague, hallucinated policy clauses, and ignored retrieved knowledge chunks. Engineers first blamed RAG chunking and raised the context window budget. Memory spiked; latency doubled; quality still collapsed on hour-long threads. The real issue was KV cache eviction: their sliding-window trim dropped the first four prompt tokens — system header and BOS — assuming recent turns mattered more. Those tokens were attention sinks: positions that collect disproportionate attention mass even when semantically empty. Remove them and the model's attention distribution destabilizes.

Attention sinks are a discovered property of pretrained transformers: a small number of initial (and sometimes delimiter) tokens absorb “overflow” attention that cannot be allocated elsewhere under softmax. StreamingLLM and related serving techniques exploit this to run effectively infinite generation with a fixed KV cache footprint. This guide covers the sink phenomenon, eviction policies that preserve sinks, comparison to sliding-window attention and summarization fallbacks, the Harbor Support gateway refactor, a technique decision table, pitfalls, and a production checklist.

What attention sinks are

In causal self-attention, each token distributes its attention weights across all prior positions. Softmax forces weights to sum to 1. When most positions carry meaningful semantic content, a few tokens — often the first token (BOS) and nearby positions — receive surprisingly high attention scores from tokens far downstream, even though those early tokens carry little lexical information.

Researchers call these attention sinks. They act as a structural “dumping ground” for attention mass the model cannot place on content tokens without breaking learned patterns. The effect appears across Llama, Mistral, and GPT-style architectures and persists after fine-tuning unless training explicitly changes early-token statistics.

Sinks are not the same as a well-written system prompt. A system prompt carries instructions the model should attend to. Sink tokens may be semantically vacuous but structurally necessary. Deleting or re-embedding them mid-session is qualitatively different from trimming old user messages.

Why naive KV eviction breaks long chats

Production systems cap KV cache size to fit GPU memory. Common strategies:

  • Truncate oldest tokens — drop prefix until under budget.
  • Sliding window — keep only the last W tokens per layer.
  • Summarize and replace — compress history into a shorter block.

Truncating from the left often removes sink tokens first. Experiments show perplexity spikes and coherence collapses when the first four positions are evicted — even if the last 4,000 tokens of conversation remain intact. The model still runs; it just attends incorrectly. That matches Harbor Support's symptom: fluent but wrong answers, not CUDA OOM errors.

Summarization avoids the sink issue if the summary is prepended as a fresh prefix (creating new sinks), but introduces hallucination risk when summaries omit constraints. Sink-aware eviction is cheaper and more faithful for agent loops that must preserve tool outputs verbatim.

StreamingLLM and sink-aware eviction

StreamingLLM (Xiao et al., 2023) demonstrated that retaining only:

  1. Attention sink tokens — typically the first 4 positions;
  2. A recent sliding window — e.g. the last 2,048 tokens;

…matches full-cache quality on million-token prompts for streaming generation, with constant memory. Middle tokens are discarded. Intuition: sinks stabilize global attention; the window preserves local semantics.

Implementation sketch

  • During prefill, mark sink indices (0..S-1) as non-evictable.
  • When cache exceeds budget B, evict tokens in (S .. len-W) before touching sinks or the window.
  • On each decode step, append to the window; evict oldest non-sink, non-window entries.
  • Log effective context: sinks + window + new token.

vLLM, SGLang, and llama.cpp expose variants of this pattern under names like “chunked prefill with sink tokens” or “streaming mode.” Check whether your stack preserves BOS and the first few special tokens after template rendering — chat templates can shift sink indices.

Sinks vs sliding-window attention vs long-context training

Sliding-window attention (SWA)

Models like Mistral bake a local attention mask into architecture: each layer only sees the last W tokens by design. See our sliding-window guide. SWA reduces training and inference cost; sinks still matter at the window boundary. StreamingLLM is complementary: it applies sink retention on top of models trained with full or windowed attention.

Long-context extensions (RoPE scaling, YaRN)

RoPE scaling and YaRN extend positional encodings so models attend farther without retraining. That raises the ceiling but does not remove KV memory cost. Sink-aware eviction still helps when sessions exceed hardware budgets or API per-request limits.

Learned sink tokens

Some research adds explicit “sink” placeholder tokens during fine-tuning so eviction policies can target them. Production systems usually rely on native BOS sinks rather than retraining.

Where sinks show up in production workloads

  • Multi-turn support chat — hours-long threads with tool calls; naive left truncation causes policy drift.
  • Agent loops — ReAct traces grow quickly; sinks plus a window beat full-history re-prefill on cost. Pair with history management.
  • Code assistants — large file context plus chat; evict middle imports before sink tokens or recent edits.
  • Streaming completions — infinite JSON or log generation for monitoring parsers.
  • Batch inference on long documents — when prefill-decode disaggregation ships KV blocks between nodes, sink indices must stay aligned across handoff.

Harbor Support chat gateway refactor

Harbor's gateway previously applied FIFO eviction on the combined system + history buffer when token count exceeded 8,192. The refactor:

  1. Sink reservation — first 4 token positions after chat template render are pinned; never evicted.
  2. Protected system block — policy and tool schema tokens (typically 800–1,200 tokens) sit immediately after sinks; evicted only after middle history, not before.
  3. Window size 3,072 — recent user, assistant, and tool messages kept in full; middle turns dropped sink-safely.
  4. RAG re-injection — when middle eviction removes a cited chunk, re-fetch via incremental index lookup instead of trusting stale KV.
  5. Quality guard — if faithfulness score drops below threshold, trigger one-shot compression of evicted span into a structured note prepended after sinks.

Hour-long session hallucination rate fell 41% versus naive truncation; P99 latency improved 18% versus doubling context to 16K because KV footprint stayed flat.

Technique decision table

Your situation Prefer Avoid
Infinite or hour-long streaming generation StreamingLLM: sink tokens + recent window Left-truncate from token 0
Bounded GPU memory, fixed model Sink-aware eviction with measured S and W Raising context without eviction policy
Must preserve verbatim early tool outputs Pin critical spans + sinks; evict middle only Whole-history summarization
Architecturally long-context model (128K+) RoPE scaling + sink eviction for cost control Assuming full cache fits on one GPU
Short Q&A (<2K tokens) Full KV cache; sinks matter less Complex eviction logic
Multi-tenant API with prompt caching Provider prefix cache + sink-stable templates Reordering system prompt fields per request

Common pitfalls

  • Evicting BOS and the first special tokens. Default FIFO almost always hits sinks first; quality degrades before memory errors appear.
  • Assuming sink count is always four. Measure on your model and chat template; some templates add sink-like behavior on delimiter tokens.
  • Ignoring chat template shifts. Jinja renders can insert tokens before your intended system block; recompute sink indices after template expansion.
  • Confusing sinks with important instructions. Pinning policy text is separate from sink preservation; both may be needed.
  • Evicting without RAG re-fetch. Middle eviction removes retrieved evidence from KV; re-inject citations when answers reference docs.
  • Benchmarking only on perplexity. Task-specific eval (tool accuracy, citation faithfulness) catches sink regressions perplexity misses.

Production checklist

  • Identify sink indices on your model with attention visualization or StreamingLLM defaults.
  • Mark sink positions non-evictable in KV cache manager.
  • Define window size W from GPU memory budget and GQA head count.
  • Evict middle tokens only; never drop sinks or the active window.
  • Recompute sink indices after chat template or tokenizer changes.
  • Protect system and tool-schema blocks after sinks in eviction priority.
  • Re-inject RAG chunks when eviction removes cited context.
  • Log evicted token ranges for debugging quality regressions.
  • A/B long sessions: sink-aware vs full cache on faithfulness metrics.
  • Pair with prompt caching for static prefixes.
  • Document S and W in runbooks; tune per model revision.
  • Fall back to summarization only when sink+window still exceeds budget.

Key takeaways

  • Attention sinks are initial (and sometimes delimiter) tokens that absorb disproportionate attention mass — they are structurally necessary, not semantically optional.
  • Naive left truncation of KV cache destroys sinks and collapses coherence on long streams even when recent context is intact.
  • StreamingLLM keeps sink tokens plus a recent sliding window for near-full quality at constant memory — Harbor Support cut hallucinations 41% with this pattern.
  • Sink-aware eviction complements sliding-window models, RoPE scaling, and RAG — it does not replace them.
  • Measure sink indices per chat template; eviction policies must survive template and model upgrades.

Related reading