Guide

State space models and Mamba explained

Every transformer layer compares every token to every other token. That quadratic attention cost is manageable at 4k–128k context when you have fast GPUs and tricks like Flash Attention — but it still scales as O(n²) in sequence length. For genomics reads, high-frequency sensor logs, or hour-long audio, researchers revived a different idea: state space models (SSMs), which compress history into a fixed-size hidden state updated at each timestep. Classical RNNs did this but forgot long-range dependencies; S4 and Mamba fix that with structured linear dynamics and selective gating. Mamba runs in linear time in sequence length, uses a hardware-friendly parallel scan during training, and powers hybrid architectures that mix SSM blocks with attention layers. This guide explains continuous-time SSMs, the HiPPO initialization, selective state spaces, how Mamba inference differs from transformer prefill, a Harbor Fleet telemetry forecaster worked example, an architecture decision table, common pitfalls, and a practitioner checklist — alongside our transformer architecture guide, RNN and LSTM guide, and KV cache guide.

What a state space model is

A state space model describes how a latent state x(t) evolves over time and how observations y(t) are generated from that state. In continuous time, the linear SSM is:

x′(t) = A x(t) + B u(t)
y(t) = C x(t) + D u(t)

Matrices A, B, C, and D define the dynamics. For deep learning, we discretize to a timestep Δ and unroll a recurrence over tokens — much like an RNN, but with structured A matrices chosen so gradients propagate across thousands of steps without vanishing.

Why not vanilla RNNs?

Vanilla RNNs learn arbitrary nonlinear transitions; in practice they struggle with long-range credit assignment — the same problem LSTMs partially solved with gates. SSMs instead parameterize A as a diagonal-plus-low-rank or normal-plus-low-rank structure, enabling stable dynamics and closed-form solutions over long horizons. You get RNN-like constant memory at inference (O(1) state per layer) without storing a growing KV cache.

From controls to deep sequence modeling

Control theorists used Kalman filters and SSMs for decades. The modern ML bridge came when researchers asked: what initial state and matrix structure lets a model remember an entire input history with bounded state? That question led to HiPPO (High-order Polynomial Projection Operators) and the S4 (Structured State Space Sequence) layer.

HiPPO, S4, and parallel scan

HiPPO theory defines matrices A that optimally compress an input signal into polynomial coefficients — a principled way to preserve long-range information in a fixed-size state. S4 (Gu et al., 2022) made this practical by:

  • Parameterizing A in a diagonal form after a change of basis, so matrix exponentials become element-wise operations.
  • Computing the convolution kernel of the SSM in Fourier space, turning training into a global convolution that parallelizes across sequence length.
  • Using the parallel scan algorithm to evaluate the recurrence in O(n log n) or O(n) work on GPUs — not sequential RNN unrolling.

S4 matched or beat transformers on Long Range Arena benchmarks (paths, audio, genomics) while using less memory at inference. The catch: early S4 layers used input-independent dynamics — the same A, B, C regardless of token content. That limits expressivity for language, where you need to selectively remember or forget based on what you just read.

S4 vs attention complexity

Mechanism Training (per layer) Inference memory Content-dependent routing
Full self-attention O(n²) O(n) KV cache Yes — every pair interacts
S4 (fixed SSM) O(n log n) via FFT/scan O(1) state No — linear time-invariant
Mamba (selective SSM) O(n) scan O(1) state Yes — input-dependent Δ, B, C

Mamba: selective state spaces

Mamba (Gu & Dao, 2023) makes SSM parameters functions of the input. For each token, the model predicts:

  • Step size Δ — how much to update the state at this position (analogous to LSTM forget/input gates).
  • Input matrices B and C — what information enters and leaves the state at this step.

Because Δ, B, and C vary per token, the recurrence is time-varying and cannot use a single global FFT convolution. Mamba instead uses a hardware-aware selective scan: blocks of the sequence are processed on SRAM with fused kernels, similar in spirit to how Flash Attention tiles attention — but for recurrence instead of pairwise dot products.

Mamba block architecture

A typical Mamba block contains:

  1. A linear projection that expands the hidden dimension.
  2. A short depthwise causal convolution for local context (capturing n-gram-like patterns attention handles poorly at tiny windows).
  3. The selective SSM core that propagates state along the sequence.
  4. A gated SiLU/Swish nonlinearity and projection back to model width.

Stacked Mamba blocks form Mamba-1 language models competitive with same-size transformers on perplexity, with faster generation throughput at long contexts because there is no KV cache growth. Mamba-2 (2024) reframes selective SSMs as a form of structured state space duality with attention, enabling larger state dimensions and better GPU utilization by aligning SSM computation with matrix multiply units.

Hybrid models

Production systems rarely go all-in on one primitive. Jamba, Zamba, and other hybrids interleave transformer attention layers with Mamba blocks — attention for precise in-context retrieval, SSM for efficient long-context mixing. Think of it as: Mamba handles "read the whole stream once," attention handles "compare these two specific tokens."

When SSMs beat transformers — and when they do not

SSMs shine when:

  • Sequence length dominates cost — audio waveforms, DNA, telemetry, logs where n > 100k and quadratic attention is prohibitive even with Flash Attention.
  • Streaming inference matters — constant memory state suits edge devices and real-time pipelines without KV-cache paging.
  • Data is inherently sequential — time-series forecasting, event streams, where causal recurrence matches the generative process.

Transformers still lead when:

  • In-context learning and copy tasks need arbitrary token-to-token lookups — attention's all-pairs routing is hard for fixed-state recurrence to emulate.
  • Ecosystem maturity — Hugging Face, LoRA tooling, quantization stacks, and serving engines are transformer-first; Mamba support is growing but narrower.
  • Short-context chat at scale — below ~32k tokens, optimized transformer inference with KV cache often wins on latency per token for autoregressive decoding.

For a deeper comparison of recurrence families, see RNN, LSTM, and GRU explained. For attention-specific optimizations, see Flash Attention explained.

Worked example: Harbor Fleet telemetry forecaster

Harbor Fleet operates 240 electric delivery vans. Each van streams GPS, battery state-of-charge, motor temperature, and brake events every five seconds — roughly 17,280 samples per van per day. The ops team wants a model that flags likely battery-thermal anomalies before a van derates on the highway, using the past 48 hours of multivariate telemetry per vehicle.

Why a transformer was awkward

A 48-hour window at 5-second resolution is ~34,560 timesteps per channel. Even with patching (aggregating 60-second bins), context exceeds 500 tokens across six channels. Full attention at batch size 64 across the fleet saturates GPU memory during training; at inference, maintaining a KV cache per active van is expensive for a streaming alert service.

SSM pipeline

  1. Normalize each sensor channel (z-score per van, clipped outliers from GPS glitches).
  2. Embed each timestep into a 128-d vector via a linear layer over the six channels plus learned hour-of-day features.
  3. Stack four Mamba blocks (state dimension 64, expand factor 2) — total receptive field covers the full 48-hour window through recurrent state, not explicit pairwise attention.
  4. Classification head on the final state: binary "thermal risk in next 2 hours" with focal loss because positives are rare (~0.3% of windows).

Training uses the parallel scan on 4×A100; inference runs on a CPU+GPU edge box per region with constant 64-dimensional state updated every five seconds as new telemetry arrives — no growing cache. Compared to a 6-layer transformer baseline with local window attention, the Mamba model achieved similar AUROC (0.91 vs 0.90) with 3.2× lower inference memory and 2.1× higher throughput on the streaming path.

What they kept from classical ML

SSMs are not magic — Harbor still uses a Kalman filter for GPS smoothing upstream and a gradient-boosted model on daily aggregates for fleet-level capacity planning. The Mamba layer sits where long causal sequences meet real-time constraints.

Architecture decision table

Your problem Favor Why Watch out for
Million-token genomics or audio S4 / Mamba Linear scaling; fixed inference state Need domain-specific tokenization
Chat assistant, <32k context Transformer + KV cache Mature tooling; strong ICL Quadratic training cost at scale
Streaming sensor anomaly detection Mamba / SSM Constant memory; causal by construction Label imbalance; rare events
RAG over document chunks Transformer encoder Cross-attention to retrieved passages Chunking and reranking still required
Long doc + precise copy Hybrid (Mamba + attention) SSM mixes; attention retrieves More hyperparameters; newer stacks
Small tabular time series (<500 steps) LightGBM / TCN Simpler; less data hunger SSM overkill without long context need

Common pitfalls

  • Assuming Mamba replaces transformers everywhere — on standard NLP benchmarks at 2k–8k context, parity required careful tuning; ecosystem gaps remain.
  • Ignoring discretization stability — learned Δ must stay in a stable range; poor initialization causes exploding states early in training.
  • Skipping the local conv — the depthwise conv in Mamba blocks matters; ablations show perplexity regressions without it.
  • Training without mixed precision care — scan kernels can underflow in FP16; BF16 or loss scaling may be required.
  • Expecting zero-shot copying — selective SSMs are weaker than attention on synthetic copy and induction-head tasks unless hybridized.
  • Wrong baseline — compare against windowed attention or linear attention, not a naive unoptimized transformer.
  • Undersized state dimension — Mamba quality scales with state size; tiny states lose the long-memory advantage.
  • Deploying without streaming tests — parallel-scan training code paths differ from step-by-step inference; validate state carry-over across chunk boundaries.

Practitioner checklist

  • Quantify sequence length n and whether quadratic attention is the actual bottleneck (profile HBM vs FLOPs).
  • Confirm causality requirements — SSMs are naturally causal; bidirectional tasks need encoder-style variants or two-pass models.
  • Start from a reference implementation (mamba-ssm, state-spaces/mamba) before writing custom scans.
  • Match state dimension and expand factor to GPU SRAM constraints on target hardware.
  • Use learning-rate warmup — selective SSMs can be sensitive in the first thousand steps.
  • Benchmark inference as tokens per second at fixed context, not just training loss.
  • Test state reset behavior at session boundaries (new document, new vehicle, new user).
  • For language tasks, compare against a hybrid baseline (e.g., attention every 4th layer).
  • Monitor gradient norms on Δ parameters — spikes often precede NaNs.
  • Document whether production uses chunked prefill with state handoff — document boundary bugs are common.

Key takeaways

  • State space models maintain a fixed-size hidden state updated each timestep — linear cost in sequence length at inference.
  • S4 introduced structured matrices and parallel scan training; Mamba adds input-dependent selective dynamics for language-scale expressivity.
  • Mamba avoids KV-cache growth, making long streaming sequences and edge deployment more tractable than full attention.
  • Hybrid architectures (Mamba + attention) are emerging as the pragmatic default for long-context LLMs.
  • Choose SSMs when length and streaming dominate; choose transformers when arbitrary in-context retrieval and tooling maturity dominate.

Related reading