Guide

LLM interpretability explained

A fraud classifier flags a legitimate wire transfer because the memo contains the word “urgent.” The compliance team asks why. Your LLM routing layer says “high risk tone” with no further detail. Without LLM interpretability tooling, you are stuck re-prompting and guessing. Interpretability is the practice of measuring what representations and circuits inside a transformer correlate with — or causally drive — specific behaviors. This guide separates post-hoc explanations from mechanistic analysis; covers probing, attribution maps, sparse autoencoders, and activation patching; explains faithfulness limits; walks through a Harbor Analytics refusal debugger; compares methods in a decision table; lists common pitfalls; and ends with a production checklist. For the underlying architecture, see our transformer architecture primer; for attention math, see attention mechanisms; for measuring output quality, see LLM evaluation and benchmarking.

Interpretability vs explainability

Explainability (XAI) usually means producing a human-readable rationale after the model decides — saliency heatmaps, natural-language justifications, or counterfactual edits. Interpretability goes deeper: it asks what information is encoded in hidden states and which components implement which algorithms.

Why production teams care

  • Debugging regressions — a fine-tune broke refusal behavior; you need to know whether safety vectors moved or retrieval poisoned context.
  • Compliance and audit — regulators increasingly expect evidence that automated decisions are not driven solely by protected attributes, even when the model is a black box.
  • Safety research — detecting deception, sycophancy, or backdoor triggers requires causal tests, not vibes.
  • Product trust — showing merchants which document spans influenced an answer reduces support tickets more than a generic “the AI said so.”

No method gives a complete map of a 70B-parameter model. The goal is actionable partial understanding with known failure modes.

Probing: what do hidden states know?

Probing trains a small classifier on frozen layer activations to predict a property (sentiment, language, toxicity, POS tags). If a linear probe on layer 12 predicts toxicity with 95% accuracy, the representation likely encodes toxicity — but probes can also read information that the main model never uses downstream.

Linear vs non-linear probes

Linear probes test whether information is linearly decodable. MLP probes add capacity and risk overfitting to probe artifacts. Report both probe accuracy and a control task (random labels) to detect trivial solutions.

Layer-wise analysis

Plot probe accuracy by layer and token position. Early layers often encode syntax; middle layers semantics; late layers task-specific logits precursors. A spike at layer 18 for “contains PII” tells you where to attach guardrail heads without retraining the full stack.

Limitations

Probing shows presence of information, not use. The model may encode gender from a name but ignore it for salary recommendations. Follow promising probes with causal interventions (below).

Attribution: which tokens mattered?

Attribution methods score input tokens (or neurons) by their influence on an output logit or loss. They are cheap and UI-friendly but often unfaithful.

Attention weights (use with caution)

Raw attention maps are not explanations — attention is one routing mechanism among many residual paths. Attention rollout and attention flow aggregate heads into softer maps that correlate better with importance, but still mislead on counterfactual tasks.

Gradient-based methods

Integrated Gradients, Grad-CAM adaptations, and SmoothGrad compute how small input perturbations change the target logit. They work on embeddings and token inputs; normalize scores per sequence to compare spans fairly.

Perturbation baselines

Leave-one-out masking, LIME-style random superpixels on text spans, and SHAP sampling approximate feature contributions. They are slow but intuitive for legal reviewers. Cache model forward passes when testing many masks.

Faithfulness tests

A faithful attribution should rank-remove important tokens and watch probability drop. Run comprehensiveness (remove top-k attributed tokens) and sufficiency (keep only top-k) benchmarks before showing heatmaps to customers.

Mechanistic interpretability: circuits and features

The mechanistic school treats transformers as programs built from reusable subcircuits — induction heads, previous-token heads, entity trackers. Research tools are entering production monitoring stacks.

Sparse autoencoders (SAEs)

SAEs train on residual-stream activations to decompose dense vectors into sparse, interpretable features (e.g., “mentions SQL injection,” “German formal register”). Libraries like Neuronpedia and open SAE weights for Llama and GPT-2 layers let engineers search for firing features on failure prompts.

Activation patching and causal tracing

Run a clean prompt and a corrupted prompt (swap one token, flip language, inject a backdoor trigger). Copy activations from clean into corrupted at a specific layer/head and measure recovery of the clean output. Patches that restore behavior identify causal sites — far stronger evidence than probes.

Path patching and knockouts

Zero-ablate attention heads or MLP neurons and measure task metric degradation. Head knockouts helped map induction circuitry in small models; at 70B scale, use sampled heads and approximate influence functions unless you own a cluster.

Interpretability for RAG and tool-using agents

Production systems add retrieval chunks and tool JSON to the context window. Interpretability must cover the full pipeline, not just the base model.

  • Chunk attribution — score retrieved passages with leave-one-out perplexity or counterfactual removal from context; surface the three spans that moved the answer most.
  • Tool-call tracing — log which tool outputs changed the next-token distribution (store pre/post hidden-state cosine shift when feasible).
  • Multi-hop failures — when the model cites the wrong doc, check whether retrieval or generation failed by patching embeddings from an alternate chunk.

Pair chunk attribution with hallucination debugging workflows: if removing the cited chunk does not change the answer, the model invented the citation.

Worked example: Harbor Analytics refusal debugger

Harbor Analytics ships a content-policy layer on a fine-tuned Llama classifier. After a June deploy, refusal rate on benign medical FAQs jumped 12%. Debug workflow:

  1. Slice evals — cluster false refusals; top pattern: questions containing “chest” without violence context.
  2. Linear probes — train toxicity probes per layer on frozen base vs fine-tuned weights; largest shift at layer 22 MLP output.
  3. SAE feature search — feature #1847 (“graphic injury”) fires on “chest pain” prompts only post-fine-tune; pre-fine-tune feature #902 (“medical symptom”) suppressed.
  4. Activation patching — patch layer 22 residuals from pre-fine-tune checkpoint into refusal runs; false refusal rate drops from 12% to 2%.
  5. Data fix — fine-tuning set over-weighted trauma forum examples; rebalance and add 2k clinical FAQ pairs.
  6. Production guard — ship a lightweight probe head on layer 22 that gates obvious medical-intent prompts to a specialist model path.

Total engineer time: three days. Without interpretability, the team would have rolled back the entire fine-tune and lost legitimate safety gains on self-harm content.

Method decision table

Method Best for Cost / skill Faithfulness
Linear probing Discover if property is encoded; layer selection for heads Low; GPU hours on sample set Correlational
Integrated Gradients Token saliency for single-turn QA Low per query; scales with sequence length Medium; test with knockouts
Leave-one-out masking Legal/compliance review of short prompts High (many forward passes) High for local importance
Sparse autoencoders Feature vocabulary for monitoring and safety Medium; needs SAE weights per layer Descriptive; validate causally
Activation patching Confirm causal layers for a behavior Medium-high; pairs of forward passes High for localized circuits
Natural-language explanations User-facing UI summaries Low (ask the model) Low; often post-hoc rationalization

Common pitfalls

  • Showing attention maps as proof — pretty heatmaps that fail knockouts erode trust faster than no explanation.
  • Asking the model to explain itself — chain-of-thought rationales can be unfaithful; never use them as sole compliance evidence.
  • Probe overfitting — a memorizing probe on 500 examples does not mean the property is used at inference.
  • Extrapolating from 7B to 70B — circuit locations shift with scale; re-run patching on the deployed checkpoint.
  • Ignoring context length — attribution on 200k-token agent traces is expensive; sample critical turns or use chunk-level scores.
  • Confusing correlation with fairness — removing a salient token may not remove bias encoded across many features.
  • No baseline comparisons — always compare attributions against random tokens and shuffled labels.

Production checklist

  • Define the behavior to explain (refusal, toxicity, wrong citation) before picking tools.
  • Log hidden-state hooks or use a serving framework that exports layer outputs on sampled traffic.
  • Build a golden set of failure prompts with human labels for probe training and patching.
  • Run faithfulness knockouts on any customer-facing attribution UI.
  • Version SAE and probe artifacts alongside model checkpoints in your registry.
  • Automate layer-wise probes in CI when fine-tuning safety-critical heads.
  • Document known limitations in internal runbooks and external model cards.
  • Separate RAG chunk attribution from base-model attribution in incident reports.
  • Cap attribution compute per request to avoid latency spikes in production.
  • Re-validate interpretability claims after quantization or distillation deploys.
  • Train support staff: explanations are hypotheses, not legal findings.
  • Pair interpretability with red-team evals, not as a replacement.

Key takeaways

  • Interpretability measures internal structure; explainability narrates outputs — know which you need.
  • Probes find encoded information; activation patching tests whether that information is causally used.
  • Attribution heatmaps require faithfulness knockouts before customer-facing use.
  • Sparse autoencoders turn opaque residuals into searchable feature dictionaries for monitoring.
  • Production debugging combines slice evals, probes, patching, and data fixes — not one silver bullet.

Related reading