Guide
Linear attention explained
Harbor Analytics runs compliance triage on 150–250 page PDFs uploaded to
on-prem CPU nodes with no GPU budget. A fine-tuned 7B
transformer
with standard softmax
attention
could not finish a forward pass under 90 seconds: quadratic
O(n²) matmuls dominated wall time and blew the 2 GB RAM
cap per worker. Swapping the attention block for a
linear attention layer — the family of methods that
approximate softmax scores with a positive feature map and
exploit matrix associativity — cut per-document latency to 11 seconds
with only a 1.8-point F1 drop on their section-labeling task.
Linear attention is not one trick; it is a design space:
Linear Transformer (elementwise feature maps), Performer (random Fourier
features), RetNet (retention with decay), and hybrids that interleave linear
blocks with a few full-attention layers. This guide explains the kernel
rewrite, causal recurrence for streaming inference, how linear attention
compares to
Flash Attention
(still quadratic, but IO-efficient) and
Mamba-style SSMs
(different state-update mechanics), the Harbor CPU triage refactor, an
architecture decision table, pitfalls, and a production checklist.
From softmax attention to a linear-time rewrite
Standard scaled dot-product attention computes, for each query position
i, a weighted sum over keys:
Attn(Q, K, V)i = ∑j softmax(qi·kj) vj
Materializing the full n × n score matrix costs
O(n²) time and memory. Linear attention observes that
softmax can be approximated by a kernel
k(q, k) = φ(q)Tφ(k) where
φ maps vectors to a (possibly infinite-dimensional) feature
space with non-negative inner products. Then:
Attn(Q, K, V)i = φ(qi)T (∑j φ(kj) vjT) / φ(qi)T (∑j φ(kj))
The sums over j depend only on accumulated statistics, not on
pairwise comparisons. For a sequence of length n, you compute
running aggregates in one left-to-right pass: linear time in
n when feature dimension dφ is fixed.
Training can use the same trick globally (non-causal) or with a causal mask
implemented as recurrence.
Feature maps in practice
Linear Transformer (Katharopoulos et al.) uses
φ(x) = elu(x) + 1 — simple, no randomness, easy on
CPU. Performer (Choromanski et al.) approximates the softmax
kernel with orthogonal random features (FAVOR+), often closer to true attention
at the cost of extra random projections per head. cosFormer and
related work add re-weighting to reduce approximation error on long tails.
Picking φ is not cosmetic: it changes gradient flow, numerical
stability, and how sharply the model can attend to a single distant token.
Causal recurrence and constant memory inference
For autoregressive decoding, linear attention's killer property is constant state per layer: maintain
St = St-1 + φ(kt) vtT
and zt = zt-1 + φ(kt),
then output ot = φ(qt)T St / (φ(qt)T zt).
No
KV cache
grows with sequence length in the naive form — memory is
O(dφ × dv) per head regardless of
how many tokens you have seen. That is why Harbor's CPU workers could
stream 200k tokens without paging: they never stored per-token K/V tensors.
RetNet (retention networks) generalizes this with
exponential decay per head — a learnable γ mixes
recent and distant contributions, interpolating between pure recurrence and
attention-like locality. RetNet layers train with a parallel form (like linear
attention) and decode with a compact recurrent state, which influenced several
2025–2026 hybrid LLM stacks that use retention in early layers and
full attention only in the top two blocks.
Where linear attention still loses to softmax
Sharp, sparse lookups — copying a rare token verbatim, tool-call argument extraction, needle-in-haystack at exact positions — need peaked distributions that cheap feature maps smooth out. Harbor saw this on “cite regulation paragraph” prompts: linear layers missed exact offsets until they added one standard attention layer every four blocks (a hybrid pattern also used in Longformer and Griffin-class models). For chat assistants and code models, full softmax (plus sliding windows or MLA for cache) usually wins on quality; for classification, tagging, and embedding long documents on CPU, linear attention often wins on cost.
Harbor Analytics compliance triage refactor
Harbor's pipeline labels each page of inbound PDFs as material, boilerplate, or needs review before routing to human analysts. Baseline: 7B decoder-only model, full attention, 8k sliding window with stride (lost cross-page references). Goal: single pass over full document text on 4-core Xeon workers, <15 s p95.
- Replaced 6 of 8 attention layers with Performer-style
linear attention (
m = 256random features per head, orthogonal draws frozen after init). - Kept 2 top softmax layers for global pooling signals (document-level disposition depends on scattered clauses).
- Swapped positional scheme from RoPE to ALiBi on linear blocks only — RoPE interacts poorly with some kernel approximations; ALiBi bias added directly to pre-softmax scores in the remaining full layers.
- Distilled labels from a teacher 32B model run offline on GPU; student trained with standard CE plus layer-wise hidden-state match on softmax layers only.
- Exported to ONNX with fused linear-attention recurrence ops; batch size 1 streaming inference on CPU AVX-512.
Outcome: 11.2 s p95 latency (was 94 s), 1.8 F1 drop vs teacher on held-out filings, zero OOMs in a 30-day pilot. Failure mode: long tables with repeated headers confused linear layers until they added a lightweight structure-aware chunking pass that injected page-break tokens.
Architecture decision table
| Approach | Complexity | Best when | Watch out for |
|---|---|---|---|
| Softmax + Flash Attention | O(n²) compute, optimized HBM |
Chat, code, reasoning; GPU serving; quality-first | KV cache bytes at 32k+; needs GPU kernels |
| Linear attention / Performer | O(n) time, fixed recurrent state |
Long doc classification, tagging, CPU/edge, streaming logs | Peaked sparse retrieval; kernel choice matters |
| RetNet / retention hybrids | O(n) train, compact decode state |
Hybrid LLMs wanting linear prefill + some locality control | Decay hyperparams; fewer off-the-shelf checkpoints |
| Mamba / SSM blocks | O(n) with selective gating |
Very long sequences, audio/genomics, stateful compression | Different training stack; not a drop-in attention swap |
| Sliding-window softmax | O(n × w) |
GPU LLMs with local + a few global layers (Mistral-style) | Window size trades recall vs cost |
| KV compression (GQA, MLA) | Still O(n²) scores, smaller cache |
GPU decode at 32k–128k without changing attention math | Does not help CPU quadratic matmul during prefill |
Common pitfalls
- Expecting linear attention to match softmax on generative benchmarks — MMLU and HumanEval gaps of 5–15 points are common without hybrid layers or distillation.
- Unstable denominators — when
φ(q)T ztapproaches zero, outputs explode; add epsilon and clamp feature maps to positive ranges. - Too few Performer features —
m < 64often underfits the softmax kernel; sweepmagainst your task, not only ImageNet defaults from papers. - Applying RoPE blindly — some linear formulations pair better with ALiBi or learned biases; re-run length extrapolation tests after any positional change.
- Ignoring numerical dtype — recurrent accumulation in FP16 overflows on 100k+ steps; use FP32 state with FP16 matmuls or block-wise normalization.
- Confusing linear attention with Flash Attention — Flash is an exact softmax implementation trick; linear attention changes the function approximated.
- No hybrid escape hatch — keep at least one full attention layer for global aggregation when tasks need exact long-range copy.
Production checklist
- Profile whether prefill or decode dominates; linear attention helps most when full-sequence prefill on CPU/edge is the bottleneck.
- Benchmark against a softmax teacher on your label distribution, not public LM perplexity alone.
- Sweep feature map type (elu+1 vs Performer
m) and hybrid depth (every k layers full attention). - Verify recurrent inference matches parallel training outputs within tolerance on 1k+ token sequences.
- Test numerical stability at max sequence length with realistic batching (accumulation drift).
- Export path: confirm ONNX/TorchScript supports custom recurrence ops or use reference implementations.
- Measure memory: recurrent state size vs KV cache bytes at target context for softmax baseline.
- Run needle-in-haystack or exact-span copy tests if the product needs citation or tool arguments.
- Document kernel hyperparameters in model cards so fine-tuners do not swap
φsilently. - Plan distillation or hybrid layers before promising chat-quality parity with 70B softmax models.
Key takeaways
- Linear attention approximates softmax with feature maps and associative products, reducing complexity from O(n²) to O(n) when feature dimension is fixed.
- Causal linear attention updates a compact recurrent state instead of growing a KV cache — ideal for long CPU-bound forward passes.
- Performer, Linear Transformer, and RetNet sit in one design family; pick kernels and decay based on task fidelity vs cost.
- Hybrid stacks (mostly linear + few softmax layers) often capture 90% of savings with acceptable quality on structured doc tasks.
- For GPU chat serving at scale, prefer Flash Attention plus cache compression; use linear attention where quadratic prefill is the hard limit.
Related reading
- Attention mechanism explained — Q/K/V softmax baseline linear methods approximate
- State space models and Mamba explained — another linear-time sequence family with selective gating
- Flash Attention explained — exact softmax made IO-efficient, not subquadratic
- Transformer architecture explained — where attention blocks sit in the full stack