Guide

Linear attention explained

Harbor Analytics runs compliance triage on 150–250 page PDFs uploaded to on-prem CPU nodes with no GPU budget. A fine-tuned 7B transformer with standard softmax attention could not finish a forward pass under 90 seconds: quadratic O(n²) matmuls dominated wall time and blew the 2 GB RAM cap per worker. Swapping the attention block for a linear attention layer — the family of methods that approximate softmax scores with a positive feature map and exploit matrix associativity — cut per-document latency to 11 seconds with only a 1.8-point F1 drop on their section-labeling task. Linear attention is not one trick; it is a design space: Linear Transformer (elementwise feature maps), Performer (random Fourier features), RetNet (retention with decay), and hybrids that interleave linear blocks with a few full-attention layers. This guide explains the kernel rewrite, causal recurrence for streaming inference, how linear attention compares to Flash Attention (still quadratic, but IO-efficient) and Mamba-style SSMs (different state-update mechanics), the Harbor CPU triage refactor, an architecture decision table, pitfalls, and a production checklist.

From softmax attention to a linear-time rewrite

Standard scaled dot-product attention computes, for each query position i, a weighted sum over keys:

Attn(Q, K, V)_i = ∑_j softmax(q_i·k_j) v_j

Materializing the full n × n score matrix costs O(n²) time and memory. Linear attention observes that softmax can be approximated by a kernel k(q, k) = φ(q)^Tφ(k) where φ maps vectors to a (possibly infinite-dimensional) feature space with non-negative inner products. Then:

Attn(Q, K, V)_i = φ(q_i)^T (∑_j φ(k_j) v_j^T) / φ(q_i)^T (∑_j φ(k_j))

The sums over j depend only on accumulated statistics, not on pairwise comparisons. For a sequence of length n, you compute running aggregates in one left-to-right pass: linear time in n when feature dimension d_φ is fixed. Training can use the same trick globally (non-causal) or with a causal mask implemented as recurrence.

Feature maps in practice

Linear Transformer (Katharopoulos et al.) uses φ(x) = elu(x) + 1 — simple, no randomness, easy on CPU. Performer (Choromanski et al.) approximates the softmax kernel with orthogonal random features (FAVOR+), often closer to true attention at the cost of extra random projections per head. cosFormer and related work add re-weighting to reduce approximation error on long tails. Picking φ is not cosmetic: it changes gradient flow, numerical stability, and how sharply the model can attend to a single distant token.

Causal recurrence and constant memory inference

For autoregressive decoding, linear attention's killer property is constant state per layer: maintain

S_t = S_t-1 + φ(k_t) v_t^T and z_t = z_t-1 + φ(k_t),

then output o_t = φ(q_t)^T S_t / (φ(q_t)^T z_t). No KV cache grows with sequence length in the naive form — memory is O(d_φ × d_v) per head regardless of how many tokens you have seen. That is why Harbor's CPU workers could stream 200k tokens without paging: they never stored per-token K/V tensors.

RetNet (retention networks) generalizes this with exponential decay per head — a learnable γ mixes recent and distant contributions, interpolating between pure recurrence and attention-like locality. RetNet layers train with a parallel form (like linear attention) and decode with a compact recurrent state, which influenced several 2025–2026 hybrid LLM stacks that use retention in early layers and full attention only in the top two blocks.

Where linear attention still loses to softmax

Sharp, sparse lookups — copying a rare token verbatim, tool-call argument extraction, needle-in-haystack at exact positions — need peaked distributions that cheap feature maps smooth out. Harbor saw this on “cite regulation paragraph” prompts: linear layers missed exact offsets until they added one standard attention layer every four blocks (a hybrid pattern also used in Longformer and Griffin-class models). For chat assistants and code models, full softmax (plus sliding windows or MLA for cache) usually wins on quality; for classification, tagging, and embedding long documents on CPU, linear attention often wins on cost.

Harbor Analytics compliance triage refactor

Harbor's pipeline labels each page of inbound PDFs as material, boilerplate, or needs review before routing to human analysts. Baseline: 7B decoder-only model, full attention, 8k sliding window with stride (lost cross-page references). Goal: single pass over full document text on 4-core Xeon workers, <15 s p95.

Replaced 6 of 8 attention layers with Performer-style linear attention (m = 256 random features per head, orthogonal draws frozen after init).
Kept 2 top softmax layers for global pooling signals (document-level disposition depends on scattered clauses).
Swapped positional scheme from RoPE to ALiBi on linear blocks only — RoPE interacts poorly with some kernel approximations; ALiBi bias added directly to pre-softmax scores in the remaining full layers.
Distilled labels from a teacher 32B model run offline on GPU; student trained with standard CE plus layer-wise hidden-state match on softmax layers only.
Exported to ONNX with fused linear-attention recurrence ops; batch size 1 streaming inference on CPU AVX-512.

Outcome: 11.2 s p95 latency (was 94 s), 1.8 F1 drop vs teacher on held-out filings, zero OOMs in a 30-day pilot. Failure mode: long tables with repeated headers confused linear layers until they added a lightweight structure-aware chunking pass that injected page-break tokens.

Architecture decision table

Approach	Complexity	Best when	Watch out for
Softmax + Flash Attention	`O(n²)` compute, optimized HBM	Chat, code, reasoning; GPU serving; quality-first	KV cache bytes at 32k+; needs GPU kernels
Linear attention / Performer	`O(n)` time, fixed recurrent state	Long doc classification, tagging, CPU/edge, streaming logs	Peaked sparse retrieval; kernel choice matters
RetNet / retention hybrids	`O(n)` train, compact decode state	Hybrid LLMs wanting linear prefill + some locality control	Decay hyperparams; fewer off-the-shelf checkpoints
Mamba / SSM blocks	`O(n)` with selective gating	Very long sequences, audio/genomics, stateful compression	Different training stack; not a drop-in attention swap
Sliding-window softmax	`O(n × w)`	GPU LLMs with local + a few global layers (Mistral-style)	Window size trades recall vs cost
KV compression (GQA, MLA)	Still `O(n²)` scores, smaller cache	GPU decode at 32k–128k without changing attention math	Does not help CPU quadratic matmul during prefill

Common pitfalls

Expecting linear attention to match softmax on generative benchmarks — MMLU and HumanEval gaps of 5–15 points are common without hybrid layers or distillation.
Unstable denominators — when φ(q)^T z_t approaches zero, outputs explode; add epsilon and clamp feature maps to positive ranges.
Too few Performer features — m < 64 often underfits the softmax kernel; sweep m against your task, not only ImageNet defaults from papers.
Applying RoPE blindly — some linear formulations pair better with ALiBi or learned biases; re-run length extrapolation tests after any positional change.
Ignoring numerical dtype — recurrent accumulation in FP16 overflows on 100k+ steps; use FP32 state with FP16 matmuls or block-wise normalization.
Confusing linear attention with Flash Attention — Flash is an exact softmax implementation trick; linear attention changes the function approximated.
No hybrid escape hatch — keep at least one full attention layer for global aggregation when tasks need exact long-range copy.

Production checklist

Profile whether prefill or decode dominates; linear attention helps most when full-sequence prefill on CPU/edge is the bottleneck.
Benchmark against a softmax teacher on your label distribution, not public LM perplexity alone.
Sweep feature map type (elu+1 vs Performer m) and hybrid depth (every k layers full attention).
Verify recurrent inference matches parallel training outputs within tolerance on 1k+ token sequences.
Test numerical stability at max sequence length with realistic batching (accumulation drift).
Export path: confirm ONNX/TorchScript supports custom recurrence ops or use reference implementations.
Measure memory: recurrent state size vs KV cache bytes at target context for softmax baseline.
Run needle-in-haystack or exact-span copy tests if the product needs citation or tool arguments.
Document kernel hyperparameters in model cards so fine-tuners do not swap φ silently.
Plan distillation or hybrid layers before promising chat-quality parity with 70B softmax models.

Key takeaways

Linear attention approximates softmax with feature maps and associative products, reducing complexity from O(n²) to O(n) when feature dimension is fixed.
Causal linear attention updates a compact recurrent state instead of growing a KV cache — ideal for long CPU-bound forward passes.
Performer, Linear Transformer, and RetNet sit in one design family; pick kernels and decay based on task fidelity vs cost.
Hybrid stacks (mostly linear + few softmax layers) often capture 90% of savings with acceptable quality on structured doc tasks.
For GPU chat serving at scale, prefer Flash Attention plus cache compression; use linear attention where quadratic prefill is the hard limit.