Guide
Sliding window attention explained
Harbor Support's RAG pipeline ingests 20-page policy PDFs into a single
32k-token prompt. Full
self-attention
over every prior token made the
KV cache grow
quadratically during prefill and pushed decode latency past their 800 ms
first-token SLO. Switching the serving stack to a
sliding window attention mask — each token attends only
to the previous W positions (4096 in their deployment) — cut
peak cache memory by 71% and restored sub-600 ms TTFT on the same hardware,
with no measurable regression on their internal ticket-resolution benchmark.
Sliding window (also called local attention) is how Mistral,
Gemma 2, and several long-context open models ship million-token
theoretical context without million-token attention matrices. This
guide explains the mask math, how stacked layers widen the effective receptive
field, ring-buffer KV implementations, pairing with
RoPE
and
GQA,
the Harbor Support refactor, a technique decision table vs full and sparse
attention, pitfalls, and a production checklist alongside our
Flash Attention guide.
From quadratic attention to a local neighborhood
Standard decoder self-attention lets position i attend to all
positions j ≤ i. Attention scores form an
n × n matrix (causal triangle) and the
KV cache stores
keys and values for every past token during decode. Memory and compute scale
as O(n²) in sequence length n for the attention
block itself, and cache bytes scale linearly with n but with a
large constant (hidden size × num layers × head count).
Sliding window attention restricts the causal mask: token
i may attend only to tokens in
[max(0, i − W + 1), i], where W is the window
size (e.g. 4096). Tokens outside the window are masked to
−∞ before softmax. Per-layer attention complexity
drops to O(n × W) — linear in sequence length for
fixed W. During autoregressive decode, once the sequence exceeds
W, you can evict KV entries older than the window
instead of appending forever.
Effective context grows with depth
A single layer's receptive field is only W tokens, but
stacked layers compound: layer L can indirectly influence tokens
roughly L × W positions away (Beltagy et al., Longformer;
Mistral technical report). A 32-layer model with W = 4096 has a
theoretical effective span near 128k tokens even though no one layer sees more
than 4k neighbors. That is why “32k context” models can use 4k
windows without collapsing on medium-length documents — depth provides
long-range routing, not a single global attention map.
Implementation: masks, kernels, and ring-buffer KV cache
Training applies a banded causal mask. Frameworks expose this as a custom
attention mask, a attention_window_size config field (Hugging Face
Mistral/Gemma classes), or fused kernels in
Flash Attention
variants that skip zero blocks. The mask must be consistent across prefill
(process many tokens at once) and decode (one new token per step).
Ring-buffer KV cache during decode
After prefill length exceeds W, the cache stops growing. Store
keys and values in a circular buffer of length W: each new token
overwrites the slot for the token falling out of the window. Position indices
for
RoPE
still use absolute positions (token 50,000 gets position 50,000) even though
only the last W KV pairs are materialized — models are
trained with this eviction policy so logits remain calibrated.
Memory per layer for decode becomes roughly
2 × W × hidden × bytes_per_elem (K and V),
times GQA grouping if
KV heads are shared.
Compare to full attention: 2 × n × hidden. At
n = 32k and W = 4k, that is an 8× reduction
in cache footprint before accounting for GQA.
Prefill vs decode asymmetry
Prefill on a 32k prompt still computes many banded attention blocks in one
kernel launch; cost is O(n × W) not O(n²),
which is why long-document ingestion speeds up dramatically. Decode stays
cheap because each step touches at most W prior keys. Pair with
continuous batching in
inference servers
so variable-length sessions do not reserve full-length caches up front.
Where sliding window appears in production models
- Mistral 7B / Mixtral —
sliding_window=4096on alternating or all layers (check config); long context via stacked windows plus YaRN-scaled RoPE on some variants. - Gemma 2 — local-global hybrid: most layers use a 4096 window; every few layers insert full attention to inject global signal without paying full cost everywhere.
- Longformer / BigBird — research predecessors combining local windows with global tokens (CLS, task tokens) for document classification and QA.
- Custom enterprise RAG — teams wrap base models with explicit window eviction in vLLM/TGI when serving 100k+ token archives where retrieval already localizes relevant chunks.
Sliding window is not the same as speculative decoding or context extension via NTK/YaRN — it changes who attends to whom, not how fast tokens are generated or how RoPE extrapolates.
Harbor Support long-doc RAG refactor (worked example)
Problem. Harbor Support routes enterprise tickets through a 70B-class model with 32k-token prompts (retrieved policy chunks + thread history). Full attention on A100 80GB with FP8 weights: prefill peaked at 68 GB KV + activations; p95 TTFT was 1.4s; batch size capped at 4 concurrent sessions before OOM.
Change. Enabled native sliding window (W = 4096)
in the vLLM model config, matching the base Mistral training window.
Retained GQA (8 KV heads). Chunk retrieval prompt layout unchanged —
most relevant passages were already placed in the final 8k tokens per their
reranker. Added monitoring for cache eviction events.
Results. Peak KV memory fell from 41 GB to 12 GB during 32k
prefill. p95 TTFT dropped to 580 ms. Concurrent session capacity rose to
14 on the same GPU. Resolution accuracy on a 2,400-ticket holdout: 84.1% vs
84.3% full-attention baseline (within noise). Failures clustered on tickets
requiring cross-references >12k tokens apart — fixed by moving those
chunks adjacent in the prompt template, not by widening W.
Lesson. Sliding window rewards intentional prompt layout. Retrieval should place evidence near the question; do not rely on the model to find a needle 30k tokens back if no layer sees that far in one hop.
Technique decision table
| Technique | Attention scope | Memory scaling | When to choose |
|---|---|---|---|
| Full causal attention | All prior tokens | O(n) KV cache | Short context, maximum recall, small models |
| Sliding window (local) | Last W tokens per layer | O(W) KV per layer | Long prompts, Mistral/Gemma-class models, RAG with local evidence |
| Local + global layers (Gemma 2) | Window + periodic full | Between O(W) and O(n) | Need occasional global token without full n² everywhere |
| Flash Attention (kernel) | Same math, faster IO | Same asymptotics | Always enable when supported; orthogonal to windowing |
| GQA / MQA | Unchanged | Fewer KV heads | Pair with sliding window for multiplicative savings |
| ALiBi / long RoPE scaling | Positional encoding | Does not cap KV size | Extrapolate positions; combine with window for memory cap |
| RAG chunking + rerank | External retrieval | Prompt-sized | When facts live outside any feasible W; layout chunks deliberately |
Common pitfalls
- Assuming window size equals usable context —
effective reach is roughly
layers × W; plan prompts and retrieval accordingly. - Widening W without retraining or YaRN — attending beyond the trained window without position scaling hurts quality; match serving config to checkpoint metadata.
- Burying evidence outside the local band — a citation 20k tokens above the question may be invisible to early layers; use reranking to co-locate query and evidence.
- Forgetting global layers in hybrid models — Gemma 2 needs its full-attention layers enabled; disabling them collapses long-range reasoning.
- Ring-buffer position bugs — incorrect RoPE indices after eviction cause silent garbage; use battle-tested servers (vLLM, TGI) or unit-test positions against a reference forward.
- Confusing training and inference window — some models train with window W but advertise 128k via RoPE; serving still benefits from KV eviction even when positions extrapolate.
Production checklist
- Read
sliding_window/attention_window_sizefrom model config before serving. - Enable fused or Flash banded attention kernels when the framework supports them.
- Cap KV cache allocation at
Wper layer (ring buffer) for decode. - Pair sliding window with GQA/MQA for maximum KV savings.
- Layout RAG prompts so retrieved chunks sit near the user question.
- Benchmark TTFT and tokens/sec at target context length with and without windowing.
- Monitor tasks where evidence spans exceed
layers × Weffective reach. - For hybrid local-global architectures, never drop global attention layers in production.
- Document window size and eviction policy in serving runbooks for reproducibility.
- Regression-test long-range QA cases after any prompt-template or window change.
Key takeaways
- Sliding window attention limits each token to the previous W positions, reducing attention from O(n²) to O(n×W).
- Stacked layers compound receptive field to roughly layers×W even though each layer is local.
- Ring-buffer KV eviction caps decode memory at O(W) per layer — the main production win for long sessions.
- Mistral, Gemma 2, and similar models rely on windows plus depth (and sometimes global layers) for long context.
- RAG pipelines must place evidence near the query; sliding window punishes scattered prompt layouts.
Related reading
- Attention mechanism explained — full causal self-attention baseline
- LLM KV cache explained — prefill, decode, and what windowing evicts
- Flash Attention explained — IO-efficient kernels that pair with banded masks
- Rotary position embeddings (RoPE) explained — absolute positions with truncated KV storage