Guide

LLM activation steering explained

Harbor Support's 7B billing assistant passed safety evals but scored 2.1/5 on empathy in human reviews — replies were accurate yet read like a terms-of-service PDF. A full SFT retrain on 40k empathetic transcripts was queued for next quarter. Meanwhile, interpretability engineers extracted a steering vector from contrastive prompt pairs (“cold refund denial” vs “warm refund acknowledgment”), using causal tracing to locate layer 14 of the residual stream as the primary tone site. At inference they added 0.42 × vempathy to that layer's activations on every billing-intent token. Empathy scores rose to 4.0/5 on the same 200-ticket panel; factual accuracy on refund-policy questions held within 0.4 points. No weight checkpoint changed. That is activation steering (also called representation engineering): nudging internal model states at runtime instead of retraining parameters.

Steering sits between prompt engineering (external instructions the model may ignore) and fine-tuning (permanent weight changes with regression risk). It exploits the fact that many high-level behaviors — refusal, sycophancy, formality, chain-of-thought depth — correspond to approximately linear directions in hidden-state space. This guide covers contrastive vector extraction, activation addition and subtraction, layer and coefficient selection, multi-vector composition, the Harbor Support tone refactor, a technique decision table against guardrails and weight editing, pitfalls, and a production checklist alongside transformer internals.

What activation steering solves

Teams often need fast, reversible behavior tweaks:

  • Tone and style — more empathetic, concise, or formal without a new fine-tune.
  • Safety posture — strengthen or relax refusal on edge-case prompts by adding or subtracting a “refusal direction.”
  • Reasoning mode — bias toward step-by-step chain-of-thought on math-heavy routes.
  • A/B experiments — test behavioral variants with a scalar coefficient dial instead of maintaining separate model checkpoints.

Steering is not a substitute for factual updates (use RAG or knowledge editing for parametric facts) or broad skill acquisition (use LoRA/SFT). It is best for continuous, interpretable axes in representation space that generalize across paraphrases better than a single system prompt line.

How steering vectors are built

The standard recipe (Contrastive Activation Addition, representation engineering):

  1. Contrastive dataset — collect prompt pairs that differ only in the target behavior. Example: same billing question, one answer template is terse/legalistic, the other warm and acknowledging.
  2. Forward passes — run both variants through the frozen model; record hidden states at candidate layers (often mid-to-late blocks in the residual stream).
  3. Difference vector — compute mean activation difference v = mean(hpositive) - mean(hnegative) per layer. Normalize v to unit length for stable coefficient tuning.
  4. Layer sweep — evaluate each layer's vector on a held-out eval set; pick the layer with best behavior shift and minimal collateral damage.
  5. Coefficient search — at inference, apply h' = h + α v and grid-search α (typical range 0.1–1.5 depending on model size).

Vectors can be extracted from activations at the last token of the prompt, at every generated token, or only on tokens matching a classifier trigger. Production stacks often steer only on “assistant role” positions to avoid corrupting user embeddings.

Addition vs subtraction

  • Addition (+αv) amplifies the positive behavior (empathy, helpfulness, CoT).
  • Subtraction (-αv) suppresses the negative pole (sycophancy, over-refusal, hallucination-prone “confident” mode when v was trained on humble vs overconfident pairs).

Where to hook in the transformer

Decoder-only LLMs expose several intervention points:

Site Effect Typical use
Residual stream (post-attention) Broad behavioral shift; strongest for tone and refusal Empathy, formality, safety
MLP output More localized; can affect factual recall Research; overlaps with knowledge editing
Attention output Alters what tokens attend to Context focus experiments; less stable in prod
Embedding layer Global bias on all tokens Rare; high collateral risk

Mid layers (roughly 40–70% depth) often encode abstract concepts; early layers encode syntax; late layers commit to token logits. Harbor's empathy vector peaked at layer 14 of 32 — consistent with published work on refusal and sentiment directions in Llama-class models.

Multi-vector composition and dynamic steering

Real products combine behaviors:

  • Linear superpositionh' = h + αvempathy + βvconcise. Vectors are not always orthogonal; test for interference on a joint eval set.
  • Conditional steering — apply vectors only when an intent classifier fires (billing vs technical vs safety-sensitive). Reduces global style drift on unrelated queries.
  • Decay schedules — taper α over long generations so late tokens do not over-steer into repetitive phrasing.
  • Per-request overrides — expose α as an API parameter for enterprise tenants (“formal mode” = +0.6 on vformal).

Dynamic steering pairs well with model routing: a small classifier picks route and steering profile before the main model decodes.

Harbor Support empathy refactor (worked example)

  1. Contrast set — 800 billing prompts × 2 tone variants; human-rated warm vs cold templates.
  2. Tracing — activation patching confirmed layer 14 residual stream as highest causal lever for tone ratings.
  3. Vector extract — mean difference on assistant-position hidden states; L2-normalized vempathy stored as FP16 blob (4 KB for 4096-dim).
  4. Coefficient tune — grid α ∈ [0.2, 0.8] on 200 held-out tickets; α = 0.42 maximized empathy at minimal policy-error increase.
  5. Serving hook — vLLM custom forward hook on layer 14; steering disabled on safety-refusal override path.
  6. Validation — empathy 2.1 → 4.0/5; policy accuracy 96.2% → 95.8%; latency +0.3 ms per token (negligible).
  7. Rollback — feature flag sets α = 0; no checkpoint redeploy required.

They kept system prompts and RAG unchanged; steering handled paraphrase robustness where instructions alone failed. SFT retrain still planned for broader skill uplift.

Technique decision table

Approach Best when Trade-off
System prompt Fast iteration, low stakes, single-tenant chat Weak on long context and paraphrase; easy to jailbreak
Activation steering Reversible tone/safety/reasoning axis; A/B without new weights Needs contrast data and layer tuning; not for new facts
Classifier guardrails Hard block/allow on policy violations Binary; does not improve generation quality inside allowed zone
LoRA / SFT Durable broad behavior or domain skill GPU cost; regression risk; slow rollback
Knowledge editing (MEMIT) Specific factual tuple updates in weights Permanent; locality risk; not for style axes
RLHF / DPO Population-level preference alignment Expensive; opaque; hard to target one axis surgically

Common pitfalls

  • Steering factual content — vectors trained on tone can still nudge incorrect policy answers; validate factual suites separately.
  • Single-layer dogma — best layer shifts across model versions; re-run sweeps after every base-model upgrade.
  • Over-coefficient — high α causes repetitive filler, sycophancy, or gibberish; cap and monitor perplexity.
  • Non-orthogonal stacks — empathy + conciseness vectors may fight; test combinations, do not assume linear independence.
  • Steering user tokens — modifying user-side activations can break instruction following; restrict hooks to assistant positions.
  • Quantized inference — INT4 kernels may not expose hooks; verify steering path in the exact serving stack (vLLM, TensorRT-LLM).
  • No eval diversity — contrast sets from one demographic skew the vector; localize pairs per locale and product line.
  • Confusing steering with unlearning — subtracting a “harmful” direction reduces but rarely eliminates capability; combine with guardrails for high-risk domains.

Production checklist

  • Define the target behavior axis with measurable rubrics (human or LLM-as-judge).
  • Build contrastive prompt pairs (500+ per axis minimum for stable means).
  • Sweep layers and record behavior vs collateral metrics on held-out eval.
  • Normalize vectors; grid-search coefficient with safety and factuality gates.
  • Hook only assistant-position activations in the serving forward pass.
  • Feature-flag α per tenant or route; default to 0 for rollback.
  • Version vectors with base model hash; invalidate on checkpoint upgrades.
  • Monitor empathy/tone/refusal rates and perplexity in production dashboards.
  • Re-extract vectors after major base-model or tokenizer changes.
  • Document steering as inference config, not weight change, in compliance audits.

Key takeaways

  • Activation steering shifts behavior by adding contrastive vectors to hidden states at inference — no weight update required.
  • Mid-layer residual-stream hooks usually work best for tone, refusal, and reasoning mode; layer and coefficient need per-model sweeps.
  • Harbor Support raised empathy scores from 2.1 to 4.0/5 with a layer-14 vector at α=0.42 while holding policy accuracy flat.
  • Steering complements prompts and guardrails but does not replace RAG, SFT, or knowledge editing for facts and durable skills.
  • Feature-flagged coefficients give instant rollback — a major ops advantage over shipping new fine-tuned checkpoints.

Related reading