Guide

LLM refusal behavior explained

Harbor Support's warranty-policy bot was tuned aggressively after a red-team campaign found it would draft phishing emails when prompted cleverly. Engineers added keyword blocklists, a high-recall toxicity classifier, and a system prompt that said “refuse any request involving refunds, payments, or account access.” Harmful-request block rate climbed to 99.4%. So did false refusals: 41% of legitimate warranty questions — “My charger failed under warranty, what is the replacement process?” — hit a blanket refusal because the word “refund” appeared in the policy context the model retrieved. Escalations doubled; CSAT cratered.

Refusal behavior is when a language model declines to fulfill a request — explicitly (“I can't help with that”) or implicitly (empty output, endless hedging). It is distinct from errors, timeouts, and tool failures. Production systems need refusals for safety and compliance, but poorly calibrated refusals destroy utility faster than under-refusal destroys trust in low-risk domains. This guide covers refusal taxonomy, the over-refusal versus under-refusal tradeoff, layered enforcement with guardrails and moderation classifiers, calibration metrics and eval sets, user-facing refusal UX, the Harbor Support refactor, a technique decision table versus blocklists-only enforcement, pitfalls, and a production checklist.

Refusal taxonomy

Not every “no” means the same thing. Tag refusals in logs so you can tune each class independently.

Class	Trigger	Example
Safety / harm	Violence, self-harm, illegal activity instructions	“How do I synthesize…”
Policy / ToS	Product rules, regulated advice, brand boundaries	“Guarantee my portfolio returns 20%”
Capability	Model or agent cannot perform the action	“Delete my account” when no tool exists
Ambiguity	Request unclear; model refuses rather than clarify	“Fix it” with no context
Copyright / PII	Reproducing protected content or leaking secrets	“Paste the full textbook chapter”
Adversarial / injection	Detected jailbreak or hidden instruction	Markdown image tag with override prompt
Over-broad keyword	Blocklist or classifier false positive	“Refund policy” blocked because of “refund”

Safety refusals should be rare in volume but non-negotiable in severity. Capability and ambiguity refusals are often fixable with better tools or a clarifying question — treat them as product bugs, not alignment wins.

Over-refusal versus under-refusal

Every deployment sits on a frontier:

Under-refusal — the model complies with harmful, non-compliant, or out-of-scope requests. Risk: regulatory exposure, user harm, brand damage. Measured by attack success rate (ASR) in red-team suites and policy-violation rate in production sampling.
Over-refusal — the model declines benign or in-scope requests. Risk: product abandonment, support load, false sense of safety (“we refuse everything, therefore we are safe”). Measured by false refusal rate on golden helpful sets and user-reported “bot refused wrongly” tickets.

High-stakes domains (medical dosing, minors, weapons) bias toward under-refusal tolerance near zero even at over-refusal cost. Customer-support bots, coding assistants, and internal copilots usually need the opposite default: aggressive helpfulness within policy, with surgical safety blocks. Document your frontier position per surface; do not copy OpenAI's consumer chat defaults into a warranty FAQ bot.

Layered enforcement

System prompt and constitution

The base model's refusal style comes from RLHF and constitutional training. Your system prompt should narrow scope (“answer Harbor warranty questions only”) without blanket bans on vocabulary. Prefer allowlists of intents over denylists of words.

Input classifiers

Run moderation models or regex gates on user input before the main completion. Use tiered thresholds: block at 0.95 severity, flag 0.7–0.95 for logging, pass below 0.7. Never block on a single keyword without context embedding — “refund” in “explain your refund policy” differs from “refund me via this stolen card.”

Output classifiers

Post-generation filters catch policy slips when the model complies despite instructions. Stream-safe implementations buffer sentence chunks or run async review on completed paragraphs. Pair with structured output when the answer is machine-parseable.

Tool and action gates

For agents, the highest-risk refusals are not text but side effects. Refuse at the tool-policy layer: block execute_shell regardless of model willingness. Text refusals are insufficient when prompt injection can redirect tool calls.

Calibration and evaluation

Refusal tuning without metrics is guesswork. Build three eval sets:

Must-refuse — harmful, out-of-policy, and jailbreak prompts from red-team libraries. Track ASR; target near-zero on tier-1 harms.
Must-answer — golden in-scope questions your product exists to solve. Track false refusal rate; segment by topic.
Edge — dual-use, controversial-but-legitimate queries (security research, medical info, legal generalities). Track whether the model refuses, partially answers with disclaimers, or over-helps.

Report weekly: refusal rate overall, false refusal rate on must-answer, ASR on must-refuse, appeal/overturn rate from human review, and refusal class breakdown. A rising refusal rate with flat ASR usually signals over-refusal creep. Use LLM-as-judge only on bounded rubrics; human adjudication on edge sets.

User-facing refusal UX

A bare “I cannot assist with that” trains users to adversarially rephrase. Better refusals:

State the boundary briefly — “I can't provide medical diagnoses” not “policy violation detected.”
Offer a safe alternative — “I can explain general warranty coverage steps” when full legal interpretation is out of scope.
Ask one clarifying question when ambiguity, not harm, triggered the refusal.
Escalation path — human agent handoff for false refusals; log the session ID for classifier retraining.

Avoid anthropomorphic guilt (“I'm sorry but…” on every line) and never leak internal policy document text in refusals.

Harbor Support policy bot refactor

Harbor Support rebuilt refusal handling after the warranty false-positive incident:

Removed keyword blocklist on “refund,” “payment,” “account”; replaced with intent classifier trained on 12k labeled support transcripts.
Split system prompt: scope block (warranty, shipping, returns) separate from safety block (harassment, illegal activity); safety rules unchanged, scope rules rewritten as positive allowlist.
Added must-answer eval set of 800 legitimate warranty variants; CI fails if false refusal rate exceeds 3% on GPT-4o and 5% on the smaller fallback model.
Output classifier threshold lowered for financial-advice patterns only when response contains imperative verbs (“you should wire…”) not descriptive policy text.
Refusal logs now include refusal_class and trigger_layer (input_classifier, model, output_classifier) for weekly calibration reviews.
User-facing template: boundary + alternative + “Talk to an agent” link; appeal button feeds false-refusal queue.

False refusal rate on warranty intents fell from 41% to 13% in week one, then 8% after classifier retrain. Harmful-request block rate stayed above 99.1%. Escalations dropped 34%; CSAT recovered within three weeks.

Technique decision table

Scenario	Blocklists / blanket prompt bans	Calibrated layered refusal
High-volume support bot with narrow scope	High false refusal rate	Preferred — intent classifier + must-answer evals
Public open-domain chat	Insufficient alone	Preferred — model refusals + input/output moderation
Agent with dangerous tools	Irrelevant to tool abuse	Tool allowlist required; text refusal is secondary
Prototype / internal-only	Acceptable short-term	Defer until external launch
Regulated advice (medical, legal)	Easy over-refusal on edge cases	Preferred — tiered disclaimers + escalate, not blanket deny
Post-incident emergency lockdown	Fast blunt instrument	Follow with calibrated rollback plan and metrics

Common pitfalls

Keyword blocklists. “Kill” blocks “kill switch documentation” and “skill tree” in gaming contexts.
One global threshold. Support and developer surfaces need different refusal frontiers.
Refusing instead of clarifying. Ambiguity is not malice; ask a question.
No must-answer eval set. You only measure safety ASR and wonder why users leave.
Leaking refusals into RAG. Retrieved policy snippets that say “never discuss refunds” prime the model to refuse refund questions.
Ignoring model version drift. A provider update changes refusal style overnight; re-run golden sets on every model bump.
Cosmetic refusals. Model says no but tool layer still executes — users learn the bypass immediately.

Production checklist

Refusal taxonomy defined; every production refusal logged with class and layer.
Documented frontier position (over- vs under-refusal bias) per product surface.
Must-refuse, must-answer, and edge eval sets with version control.
CI gates on false refusal rate and ASR before prompt/classifier deploys.
System prompt uses scope allowlists; no blanket vocabulary bans without context.
Input and output moderation with tiered thresholds, not binary keyword lists.
Tool allowlists enforced independently of model text refusals.
User-facing refusal template: boundary, safe alternative, escalation path.
Weekly calibration review: refusal rate, class breakdown, appeal queue.
Model version change triggers full refusal eval re-run.
Red-team regression after every guardrail tightening.
Feature flag to relax or tighten refusal layers during incidents.

Key takeaways

Refusal is a product behavior, not just a safety feature. Over-refusal kills utility as surely as under-refusal kills trust.
Classify refusals. Safety, policy, capability, and false-positive blocks need different fixes.
Layer enforcement. Prompt, classifiers, tools, and UX work together; blocklists alone fail.
Measure both sides. ASR on must-refuse and false refusal rate on must-answer.
Helpful refusals exist. Boundaries plus alternatives beat silent denials.