Guide

Sliding window attention explained

Harbor Support's RAG pipeline ingests 20-page policy PDFs into a single 32k-token prompt. Full self-attention over every prior token made the KV cache grow quadratically during prefill and pushed decode latency past their 800 ms first-token SLO. Switching the serving stack to a sliding window attention mask — each token attends only to the previous W positions (4096 in their deployment) — cut peak cache memory by 71% and restored sub-600 ms TTFT on the same hardware, with no measurable regression on their internal ticket-resolution benchmark. Sliding window (also called local attention) is how Mistral, Gemma 2, and several long-context open models ship million-token theoretical context without million-token attention matrices. This guide explains the mask math, how stacked layers widen the effective receptive field, ring-buffer KV implementations, pairing with RoPE and GQA, the Harbor Support refactor, a technique decision table vs full and sparse attention, pitfalls, and a production checklist alongside our Flash Attention guide.

From quadratic attention to a local neighborhood

Standard decoder self-attention lets position i attend to all positions j ≤ i. Attention scores form an n × n matrix (causal triangle) and the KV cache stores keys and values for every past token during decode. Memory and compute scale as O(n²) in sequence length n for the attention block itself, and cache bytes scale linearly with n but with a large constant (hidden size × num layers × head count).

Sliding window attention restricts the causal mask: token i may attend only to tokens in [max(0, i − W + 1), i], where W is the window size (e.g. 4096). Tokens outside the window are masked to −∞ before softmax. Per-layer attention complexity drops to O(n × W) — linear in sequence length for fixed W. During autoregressive decode, once the sequence exceeds W, you can evict KV entries older than the window instead of appending forever.

Effective context grows with depth

A single layer's receptive field is only W tokens, but stacked layers compound: layer L can indirectly influence tokens roughly L × W positions away (Beltagy et al., Longformer; Mistral technical report). A 32-layer model with W = 4096 has a theoretical effective span near 128k tokens even though no one layer sees more than 4k neighbors. That is why “32k context” models can use 4k windows without collapsing on medium-length documents — depth provides long-range routing, not a single global attention map.

Implementation: masks, kernels, and ring-buffer KV cache

Training applies a banded causal mask. Frameworks expose this as a custom attention mask, a attention_window_size config field (Hugging Face Mistral/Gemma classes), or fused kernels in Flash Attention variants that skip zero blocks. The mask must be consistent across prefill (process many tokens at once) and decode (one new token per step).

Ring-buffer KV cache during decode

After prefill length exceeds W, the cache stops growing. Store keys and values in a circular buffer of length W: each new token overwrites the slot for the token falling out of the window. Position indices for RoPE still use absolute positions (token 50,000 gets position 50,000) even though only the last W KV pairs are materialized — models are trained with this eviction policy so logits remain calibrated.

Memory per layer for decode becomes roughly 2 × W × hidden × bytes_per_elem (K and V), times GQA grouping if KV heads are shared. Compare to full attention: 2 × n × hidden. At n = 32k and W = 4k, that is an 8× reduction in cache footprint before accounting for GQA.

Prefill vs decode asymmetry

Prefill on a 32k prompt still computes many banded attention blocks in one kernel launch; cost is O(n × W) not O(n²), which is why long-document ingestion speeds up dramatically. Decode stays cheap because each step touches at most W prior keys. Pair with continuous batching in inference servers so variable-length sessions do not reserve full-length caches up front.

Where sliding window appears in production models

  • Mistral 7B / Mixtralsliding_window=4096 on alternating or all layers (check config); long context via stacked windows plus YaRN-scaled RoPE on some variants.
  • Gemma 2 — local-global hybrid: most layers use a 4096 window; every few layers insert full attention to inject global signal without paying full cost everywhere.
  • Longformer / BigBird — research predecessors combining local windows with global tokens (CLS, task tokens) for document classification and QA.
  • Custom enterprise RAG — teams wrap base models with explicit window eviction in vLLM/TGI when serving 100k+ token archives where retrieval already localizes relevant chunks.

Sliding window is not the same as speculative decoding or context extension via NTK/YaRN — it changes who attends to whom, not how fast tokens are generated or how RoPE extrapolates.

Harbor Support long-doc RAG refactor (worked example)

Problem. Harbor Support routes enterprise tickets through a 70B-class model with 32k-token prompts (retrieved policy chunks + thread history). Full attention on A100 80GB with FP8 weights: prefill peaked at 68 GB KV + activations; p95 TTFT was 1.4s; batch size capped at 4 concurrent sessions before OOM.

Change. Enabled native sliding window (W = 4096) in the vLLM model config, matching the base Mistral training window. Retained GQA (8 KV heads). Chunk retrieval prompt layout unchanged — most relevant passages were already placed in the final 8k tokens per their reranker. Added monitoring for cache eviction events.

Results. Peak KV memory fell from 41 GB to 12 GB during 32k prefill. p95 TTFT dropped to 580 ms. Concurrent session capacity rose to 14 on the same GPU. Resolution accuracy on a 2,400-ticket holdout: 84.1% vs 84.3% full-attention baseline (within noise). Failures clustered on tickets requiring cross-references >12k tokens apart — fixed by moving those chunks adjacent in the prompt template, not by widening W.

Lesson. Sliding window rewards intentional prompt layout. Retrieval should place evidence near the question; do not rely on the model to find a needle 30k tokens back if no layer sees that far in one hop.

Technique decision table

Technique Attention scope Memory scaling When to choose
Full causal attention All prior tokens O(n) KV cache Short context, maximum recall, small models
Sliding window (local) Last W tokens per layer O(W) KV per layer Long prompts, Mistral/Gemma-class models, RAG with local evidence
Local + global layers (Gemma 2) Window + periodic full Between O(W) and O(n) Need occasional global token without full n² everywhere
Flash Attention (kernel) Same math, faster IO Same asymptotics Always enable when supported; orthogonal to windowing
GQA / MQA Unchanged Fewer KV heads Pair with sliding window for multiplicative savings
ALiBi / long RoPE scaling Positional encoding Does not cap KV size Extrapolate positions; combine with window for memory cap
RAG chunking + rerank External retrieval Prompt-sized When facts live outside any feasible W; layout chunks deliberately

Common pitfalls

  • Assuming window size equals usable context — effective reach is roughly layers × W; plan prompts and retrieval accordingly.
  • Widening W without retraining or YaRN — attending beyond the trained window without position scaling hurts quality; match serving config to checkpoint metadata.
  • Burying evidence outside the local band — a citation 20k tokens above the question may be invisible to early layers; use reranking to co-locate query and evidence.
  • Forgetting global layers in hybrid models — Gemma 2 needs its full-attention layers enabled; disabling them collapses long-range reasoning.
  • Ring-buffer position bugs — incorrect RoPE indices after eviction cause silent garbage; use battle-tested servers (vLLM, TGI) or unit-test positions against a reference forward.
  • Confusing training and inference window — some models train with window W but advertise 128k via RoPE; serving still benefits from KV eviction even when positions extrapolate.

Production checklist

  • Read sliding_window / attention_window_size from model config before serving.
  • Enable fused or Flash banded attention kernels when the framework supports them.
  • Cap KV cache allocation at W per layer (ring buffer) for decode.
  • Pair sliding window with GQA/MQA for maximum KV savings.
  • Layout RAG prompts so retrieved chunks sit near the user question.
  • Benchmark TTFT and tokens/sec at target context length with and without windowing.
  • Monitor tasks where evidence spans exceed layers × W effective reach.
  • For hybrid local-global architectures, never drop global attention layers in production.
  • Document window size and eviction policy in serving runbooks for reproducibility.
  • Regression-test long-range QA cases after any prompt-template or window change.

Key takeaways

  • Sliding window attention limits each token to the previous W positions, reducing attention from O(n²) to O(n×W).
  • Stacked layers compound receptive field to roughly layers×W even though each layer is local.
  • Ring-buffer KV eviction caps decode memory at O(W) per layer — the main production win for long sessions.
  • Mistral, Gemma 2, and similar models rely on windows plus depth (and sometimes global layers) for long context.
  • RAG pipelines must place evidence near the query; sliding window punishes scattered prompt layouts.

Related reading