Guide
LLM activation steering explained
Harbor Support's 7B billing assistant passed safety evals but scored 2.1/5 on
empathy in human reviews — replies were accurate yet read like a terms-of-service
PDF. A full
SFT
retrain on 40k empathetic transcripts was queued for next quarter. Meanwhile,
interpretability engineers extracted a steering vector from contrastive
prompt pairs (“cold refund denial” vs “warm refund acknowledgment”),
using
causal tracing
to locate layer 14 of the residual stream as the primary tone site. At inference they
added 0.42 × vempathy to that layer's activations on
every billing-intent token. Empathy scores rose to 4.0/5 on the same 200-ticket panel;
factual accuracy on refund-policy questions held within 0.4 points. No weight checkpoint
changed. That is activation steering (also called
representation engineering): nudging internal model states at runtime
instead of retraining parameters.
Steering sits between prompt engineering (external instructions the model may ignore) and fine-tuning (permanent weight changes with regression risk). It exploits the fact that many high-level behaviors — refusal, sycophancy, formality, chain-of-thought depth — correspond to approximately linear directions in hidden-state space. This guide covers contrastive vector extraction, activation addition and subtraction, layer and coefficient selection, multi-vector composition, the Harbor Support tone refactor, a technique decision table against guardrails and weight editing, pitfalls, and a production checklist alongside transformer internals.
What activation steering solves
Teams often need fast, reversible behavior tweaks:
- Tone and style — more empathetic, concise, or formal without a new fine-tune.
- Safety posture — strengthen or relax refusal on edge-case prompts by adding or subtracting a “refusal direction.”
- Reasoning mode — bias toward step-by-step chain-of-thought on math-heavy routes.
- A/B experiments — test behavioral variants with a scalar coefficient dial instead of maintaining separate model checkpoints.
Steering is not a substitute for factual updates (use RAG or knowledge editing for parametric facts) or broad skill acquisition (use LoRA/SFT). It is best for continuous, interpretable axes in representation space that generalize across paraphrases better than a single system prompt line.
How steering vectors are built
The standard recipe (Contrastive Activation Addition, representation engineering):
- Contrastive dataset — collect prompt pairs that differ only in the target behavior. Example: same billing question, one answer template is terse/legalistic, the other warm and acknowledging.
- Forward passes — run both variants through the frozen model; record hidden states at candidate layers (often mid-to-late blocks in the residual stream).
- Difference vector — compute mean activation difference
v = mean(hpositive) - mean(hnegative)per layer. Normalizevto unit length for stable coefficient tuning. - Layer sweep — evaluate each layer's vector on a held-out eval set; pick the layer with best behavior shift and minimal collateral damage.
- Coefficient search — at inference, apply
h' = h + α vand grid-searchα(typical range 0.1–1.5 depending on model size).
Vectors can be extracted from activations at the last token of the prompt, at every generated token, or only on tokens matching a classifier trigger. Production stacks often steer only on “assistant role” positions to avoid corrupting user embeddings.
Addition vs subtraction
- Addition (
+αv) amplifies the positive behavior (empathy, helpfulness, CoT). - Subtraction (
-αv) suppresses the negative pole (sycophancy, over-refusal, hallucination-prone “confident” mode whenvwas trained on humble vs overconfident pairs).
Where to hook in the transformer
Decoder-only LLMs expose several intervention points:
| Site | Effect | Typical use |
|---|---|---|
| Residual stream (post-attention) | Broad behavioral shift; strongest for tone and refusal | Empathy, formality, safety |
| MLP output | More localized; can affect factual recall | Research; overlaps with knowledge editing |
| Attention output | Alters what tokens attend to | Context focus experiments; less stable in prod |
| Embedding layer | Global bias on all tokens | Rare; high collateral risk |
Mid layers (roughly 40–70% depth) often encode abstract concepts; early layers encode syntax; late layers commit to token logits. Harbor's empathy vector peaked at layer 14 of 32 — consistent with published work on refusal and sentiment directions in Llama-class models.
Multi-vector composition and dynamic steering
Real products combine behaviors:
- Linear superposition —
h' = h + αvempathy + βvconcise. Vectors are not always orthogonal; test for interference on a joint eval set. - Conditional steering — apply vectors only when an intent classifier fires (billing vs technical vs safety-sensitive). Reduces global style drift on unrelated queries.
- Decay schedules — taper
αover long generations so late tokens do not over-steer into repetitive phrasing. - Per-request overrides — expose
αas an API parameter for enterprise tenants (“formal mode” = +0.6 onvformal).
Dynamic steering pairs well with model routing: a small classifier picks route and steering profile before the main model decodes.
Harbor Support empathy refactor (worked example)
- Contrast set — 800 billing prompts × 2 tone variants; human-rated warm vs cold templates.
- Tracing — activation patching confirmed layer 14 residual stream as highest causal lever for tone ratings.
- Vector extract — mean difference on assistant-position
hidden states; L2-normalized
vempathystored as FP16 blob (4 KB for 4096-dim). - Coefficient tune — grid
α ∈ [0.2, 0.8]on 200 held-out tickets;α = 0.42maximized empathy at minimal policy-error increase. - Serving hook — vLLM custom forward hook on layer 14; steering disabled on safety-refusal override path.
- Validation — empathy 2.1 → 4.0/5; policy accuracy 96.2% → 95.8%; latency +0.3 ms per token (negligible).
- Rollback — feature flag sets
α = 0; no checkpoint redeploy required.
They kept system prompts and RAG unchanged; steering handled paraphrase robustness where instructions alone failed. SFT retrain still planned for broader skill uplift.
Technique decision table
| Approach | Best when | Trade-off |
|---|---|---|
| System prompt | Fast iteration, low stakes, single-tenant chat | Weak on long context and paraphrase; easy to jailbreak |
| Activation steering | Reversible tone/safety/reasoning axis; A/B without new weights | Needs contrast data and layer tuning; not for new facts |
| Classifier guardrails | Hard block/allow on policy violations | Binary; does not improve generation quality inside allowed zone |
| LoRA / SFT | Durable broad behavior or domain skill | GPU cost; regression risk; slow rollback |
| Knowledge editing (MEMIT) | Specific factual tuple updates in weights | Permanent; locality risk; not for style axes |
| RLHF / DPO | Population-level preference alignment | Expensive; opaque; hard to target one axis surgically |
Common pitfalls
- Steering factual content — vectors trained on tone can still nudge incorrect policy answers; validate factual suites separately.
- Single-layer dogma — best layer shifts across model versions; re-run sweeps after every base-model upgrade.
- Over-coefficient — high
αcauses repetitive filler, sycophancy, or gibberish; cap and monitor perplexity. - Non-orthogonal stacks — empathy + conciseness vectors may fight; test combinations, do not assume linear independence.
- Steering user tokens — modifying user-side activations can break instruction following; restrict hooks to assistant positions.
- Quantized inference — INT4 kernels may not expose hooks; verify steering path in the exact serving stack (vLLM, TensorRT-LLM).
- No eval diversity — contrast sets from one demographic skew the vector; localize pairs per locale and product line.
- Confusing steering with unlearning — subtracting a “harmful” direction reduces but rarely eliminates capability; combine with guardrails for high-risk domains.
Production checklist
- Define the target behavior axis with measurable rubrics (human or LLM-as-judge).
- Build contrastive prompt pairs (500+ per axis minimum for stable means).
- Sweep layers and record behavior vs collateral metrics on held-out eval.
- Normalize vectors; grid-search coefficient with safety and factuality gates.
- Hook only assistant-position activations in the serving forward pass.
- Feature-flag
αper tenant or route; default to 0 for rollback. - Version vectors with base model hash; invalidate on checkpoint upgrades.
- Monitor empathy/tone/refusal rates and perplexity in production dashboards.
- Re-extract vectors after major base-model or tokenizer changes.
- Document steering as inference config, not weight change, in compliance audits.
Key takeaways
- Activation steering shifts behavior by adding contrastive vectors to hidden states at inference — no weight update required.
- Mid-layer residual-stream hooks usually work best for tone, refusal, and reasoning mode; layer and coefficient need per-model sweeps.
- Harbor Support raised empathy scores from 2.1 to 4.0/5 with a layer-14 vector at α=0.42 while holding policy accuracy flat.
- Steering complements prompts and guardrails but does not replace RAG, SFT, or knowledge editing for facts and durable skills.
- Feature-flagged coefficients give instant rollback — a major ops advantage over shipping new fine-tuned checkpoints.
Related reading
- LLM interpretability explained — causal tracing, activation patching, and probing
- LLM guardrails explained — input filters, output policies, and safe agents
- Transformer architecture explained — residual stream, attention, and MLP blocks
- LLM knowledge editing explained — surgical weight updates vs inference-time control