Guide
LLM content moderation and toxicity filters explained
Harbor Support's first public beta shipped with a single line in the system prompt: “Be respectful.” Within forty-eight hours, community screenshots showed the assistant quoting a user's slur back in a troubleshooting reply, and a RAG chunk from an old forum thread reintroduced harassment language the model then paraphrased as policy. Engineering had wired OpenAI's moderation endpoint — but only on the latest user message, never on retrieved context, never on streaming tokens as they left the model, and never with locale-specific blocklists for the game's largest EU market. Trust metrics cratered; moderators burned out on manual triage.
Content moderation for LLM products is not one API call. It is a layered policy engine: define what “harmful” means for your surface, scan ingress and egress at the right granularity, tune thresholds per category, route edge cases to humans, and measure drift. This guide covers moderation taxonomies, blocklists versus classifier models, placement in RAG and agent stacks, streaming egress scans, the Harbor Support chat refactor, a technique decision table, pitfalls, and a production checklist. Pair with guardrails for the broader safety architecture and human-in-the-loop for appeal queues and reviewer workflows.
Moderation taxonomy: what you are actually blocking
Vendors expose different label sets. Before picking tools, write an internal policy matrix that maps product risk to categories. Most production stacks track some subset of:
| Category | Examples | Typical action |
|---|---|---|
| Hate / harassment | Slurs, targeted insults, dehumanization | Block or mask; escalate repeat offenders |
| Violence / self-harm | Graphic gore, suicide instructions | Block; crisis-resource handoff where law requires |
| Sexual / minors | Explicit content, CSAM indicators | Hard block; legal reporting pipeline |
| Illegal activity | Weapons trafficking, fraud playbooks | Block; log for abuse team |
| Spam / manipulation | Phishing, astroturf, prompt-stuffing | Rate-limit, captcha, or shadow-ban |
| Brand / product policy | Competitor attacks, off-topic rage bait | Rewrite, refuse, or route to human |
PII leakage is usually a separate pipeline (see PII detection). Mixing PII redaction with hate-speech classifiers in one opaque score creates false positives on names and addresses. Keep policy packs composable: run PII scan, then toxicity scan, then domain-specific rules.
Detection layers: blocklists, classifiers, and LLM judges
Teams stack techniques by latency and precision. Cheap layers run first; expensive judges run only on uncertain bands.
Blocklists and regex
Hash and Aho-Corasick matchers catch known slurs, doxxing patterns, and spam URLs in sub-millisecond time. Maintain per-locale lists — English-only blocklists miss leetspeak variants and diacritic homoglyphs common in multilingual chats. Version lists in git; hot-reload without redeploying the gateway.
Classifier models
Small transformer or CNN classifiers (vendor APIs like OpenAI Moderation, open models like Llama Guard, or fine-tuned DistilBERT heads) output per-category probabilities. They generalize beyond exact strings but drift on slang and adversarial paraphrase. Run them on normalized text: Unicode NFKC, zero-width strip, repeated-character collapse.
LLM-as-judge
A secondary model with a rubric (“Does this violate policy section 4.2?”) catches nuanced harassment and context-dependent insults. Reserve for appeals, sampled production traffic, and categories where false positives are costly (medical advice refusals vs actual harm). Never use a judge as your only real-time gate without latency budget — 200–800 ms adds up in streaming UX.
Where to scan: ingress, context, egress, and tools
Harbor Support's incident was a placement failure, not a missing feature. Production moderation touches four surfaces:
- User ingress — latest message plus attachments before model call.
- Assembled context — system prompt, few-shots, and every RAG chunk in the window. Attackers and stale indexes inject policy violations here.
- Model egress — completion tokens, including streaming partials. A slur can appear in token 40; waiting for
finish_reason: stopships it to the client. - Tool and agent outputs — search results, SQL rows, and third-party API bodies re-enter the conversation unmoderated unless scanned.
For streaming chat, implement a rolling buffer egress scanner: hold the last N tokens, run the classifier every K tokens or 50 ms, and truncate with a safe refusal template if a category crosses threshold. Log the truncated span for review without showing it to the user.
Thresholds, modes, and human appeals
Binary block-everything maximizes safety and destroys utility. Mature stacks use three bands per category:
- Auto-allow — score below T_low; pass through.
- Human review — score between T_low and T_high; queue for moderators with conversation snapshot.
- Auto-block — score above T_high; refuse with category-specific copy (not a generic “I can't help with that”).
Tune thresholds per surface: internal admin copilots tolerate different risk than public teen-facing games. Track precision/recall weekly on labeled samples; adversarial users move the distribution faster than your training set. Appeals that overturn blocks feed back into hard negatives for the next fine-tune cycle.
Harbor Support chat refactor (worked example)
The refactor shipped in three PRs over nine days:
- Context-wide ingress scan — moderation runs on concatenated user message + top-k RAG chunks; chunks scoring above T_high are dropped from retrieval with a metric counter, not silently passed to the model.
- Streaming egress guard — 32-token rolling window with Llama Guard every 16 tokens; violations trigger
stopand a templated apology plus ticket escalation link. - Locale packs — DE/FR/ES blocklists and a weekly slang review ritual with community moderators; classifier thresholds lowered 0.05 on harassment for those locales after eval showed under-blocking.
- Appeal queue — false-positive reports create a HITL task; overturned blocks within 4 hours restored user trust scores.
Result: public slur-in-quote incidents dropped to zero over thirty days; false-positive rate on legitimate bug reports (users pasting crash logs with profanity) fell from 12% to 4% after allowlisting stack-trace patterns.
Technique decision table
| Approach | Best for | Weakness |
|---|---|---|
| System-prompt-only (“be nice”) | Prototypes, zero engineering time | No enforcement; trivially bypassed; no metrics |
| Blocklists only | Known spam URLs, exact slur lists | Paraphrase and multilingual evasion |
| Vendor moderation API | Fast launch, maintained labels | Vendor lock-in, latency, opaque drift |
| Self-hosted classifier | Data residency, custom categories | You own retraining and eval debt |
| Full guardrails platform | Multi-step agents, colang-style flows | Heavier ops; still not injection-proof |
| Human moderation only | Small closed betas | Does not scale; reactive not preventive |
Most teams combine blocklists + vendor or open classifier + sampled LLM judge + HITL appeals. Relying on prompt injection defenses without moderation is backwards: injection often aims to bypass policy, not merely exfiltrate data.
Common pitfalls
- Moderating only user text — model and RAG outputs cause most brand-damage incidents.
- Single global threshold — harassment scores differ by language; one cutoff under-blocks everywhere or over-blocks somewhere.
- No streaming egress scan — harmful tokens reach the UI before post-hoc batch jobs run.
- Opaque refusals — users retry with rephrased attacks; category-specific copy reduces adversarial loops.
- Ignoring moderator ergonomics — queues without context snippets burn out humans and slow appeals.
- Classifier scores in public logs — attackers reverse-engineer thresholds from telemetry.
- Conflating PII and toxicity — masking a username as “toxic” breaks legitimate support threads.
- No adversarial eval set — leetspeak, homoglyphs, and indirect hate slip through until Twitter finds them.
- Moderation after analytics export — toxic completions in warehouse tables become training data for the next model.
Production checklist
- Document policy categories and per-surface risk appetite.
- Run blocklists on normalized ingress text per supported locale.
- Scan full assembled context, not only the latest user turn.
- Implement streaming egress moderation with safe truncation templates.
- Define three-band thresholds with human review between bands.
- Separate PII pipeline from toxicity classifiers.
- Log block events with hashed user id and category, not raw slurs in plaintext.
- Weekly sample labeled traffic; track precision, recall, and appeal overturn rate.
- Maintain adversarial red-team set in CI; fail deploy on regression.
- Route appeals to HITL with full thread snapshot and one-click overturn.
- Sanitize moderated content before analytics and fine-tuning exports.
- Publish transparency copy: what is blocked, how to appeal, crisis resources where required.
Key takeaways
- Moderation is a pipeline placement problem — ingress, context, egress, and tools.
- Layer blocklists, classifiers, and judges; no single technique survives adversarial use.
- Threshold bands plus human appeals beat binary block-everything for real products.
- Streaming requires token-level egress scans, not post-completion batch jobs.
- Measure drift weekly; slang and attack patterns move faster than model releases.
Related reading
- LLM guardrails explained — input filters, output validators, and action allowlists
- LLM PII detection and redaction explained — separate pipeline for regulated data
- LLM human-in-the-loop explained — reviewer queues and escalation design
- LLM red teaming explained — adversarial eval before launch