Guide

LLM content moderation and toxicity filters explained

Harbor Support's first public beta shipped with a single line in the system prompt: “Be respectful.” Within forty-eight hours, community screenshots showed the assistant quoting a user's slur back in a troubleshooting reply, and a RAG chunk from an old forum thread reintroduced harassment language the model then paraphrased as policy. Engineering had wired OpenAI's moderation endpoint — but only on the latest user message, never on retrieved context, never on streaming tokens as they left the model, and never with locale-specific blocklists for the game's largest EU market. Trust metrics cratered; moderators burned out on manual triage.

Content moderation for LLM products is not one API call. It is a layered policy engine: define what “harmful” means for your surface, scan ingress and egress at the right granularity, tune thresholds per category, route edge cases to humans, and measure drift. This guide covers moderation taxonomies, blocklists versus classifier models, placement in RAG and agent stacks, streaming egress scans, the Harbor Support chat refactor, a technique decision table, pitfalls, and a production checklist. Pair with guardrails for the broader safety architecture and human-in-the-loop for appeal queues and reviewer workflows.

Moderation taxonomy: what you are actually blocking

Vendors expose different label sets. Before picking tools, write an internal policy matrix that maps product risk to categories. Most production stacks track some subset of:

Category Examples Typical action
Hate / harassment Slurs, targeted insults, dehumanization Block or mask; escalate repeat offenders
Violence / self-harm Graphic gore, suicide instructions Block; crisis-resource handoff where law requires
Sexual / minors Explicit content, CSAM indicators Hard block; legal reporting pipeline
Illegal activity Weapons trafficking, fraud playbooks Block; log for abuse team
Spam / manipulation Phishing, astroturf, prompt-stuffing Rate-limit, captcha, or shadow-ban
Brand / product policy Competitor attacks, off-topic rage bait Rewrite, refuse, or route to human

PII leakage is usually a separate pipeline (see PII detection). Mixing PII redaction with hate-speech classifiers in one opaque score creates false positives on names and addresses. Keep policy packs composable: run PII scan, then toxicity scan, then domain-specific rules.

Detection layers: blocklists, classifiers, and LLM judges

Teams stack techniques by latency and precision. Cheap layers run first; expensive judges run only on uncertain bands.

Blocklists and regex

Hash and Aho-Corasick matchers catch known slurs, doxxing patterns, and spam URLs in sub-millisecond time. Maintain per-locale lists — English-only blocklists miss leetspeak variants and diacritic homoglyphs common in multilingual chats. Version lists in git; hot-reload without redeploying the gateway.

Classifier models

Small transformer or CNN classifiers (vendor APIs like OpenAI Moderation, open models like Llama Guard, or fine-tuned DistilBERT heads) output per-category probabilities. They generalize beyond exact strings but drift on slang and adversarial paraphrase. Run them on normalized text: Unicode NFKC, zero-width strip, repeated-character collapse.

LLM-as-judge

A secondary model with a rubric (“Does this violate policy section 4.2?”) catches nuanced harassment and context-dependent insults. Reserve for appeals, sampled production traffic, and categories where false positives are costly (medical advice refusals vs actual harm). Never use a judge as your only real-time gate without latency budget — 200–800 ms adds up in streaming UX.

Where to scan: ingress, context, egress, and tools

Harbor Support's incident was a placement failure, not a missing feature. Production moderation touches four surfaces:

  • User ingress — latest message plus attachments before model call.
  • Assembled context — system prompt, few-shots, and every RAG chunk in the window. Attackers and stale indexes inject policy violations here.
  • Model egress — completion tokens, including streaming partials. A slur can appear in token 40; waiting for finish_reason: stop ships it to the client.
  • Tool and agent outputs — search results, SQL rows, and third-party API bodies re-enter the conversation unmoderated unless scanned.

For streaming chat, implement a rolling buffer egress scanner: hold the last N tokens, run the classifier every K tokens or 50 ms, and truncate with a safe refusal template if a category crosses threshold. Log the truncated span for review without showing it to the user.

Thresholds, modes, and human appeals

Binary block-everything maximizes safety and destroys utility. Mature stacks use three bands per category:

  • Auto-allow — score below T_low; pass through.
  • Human review — score between T_low and T_high; queue for moderators with conversation snapshot.
  • Auto-block — score above T_high; refuse with category-specific copy (not a generic “I can't help with that”).

Tune thresholds per surface: internal admin copilots tolerate different risk than public teen-facing games. Track precision/recall weekly on labeled samples; adversarial users move the distribution faster than your training set. Appeals that overturn blocks feed back into hard negatives for the next fine-tune cycle.

Harbor Support chat refactor (worked example)

The refactor shipped in three PRs over nine days:

  1. Context-wide ingress scan — moderation runs on concatenated user message + top-k RAG chunks; chunks scoring above T_high are dropped from retrieval with a metric counter, not silently passed to the model.
  2. Streaming egress guard — 32-token rolling window with Llama Guard every 16 tokens; violations trigger stop and a templated apology plus ticket escalation link.
  3. Locale packs — DE/FR/ES blocklists and a weekly slang review ritual with community moderators; classifier thresholds lowered 0.05 on harassment for those locales after eval showed under-blocking.
  4. Appeal queue — false-positive reports create a HITL task; overturned blocks within 4 hours restored user trust scores.

Result: public slur-in-quote incidents dropped to zero over thirty days; false-positive rate on legitimate bug reports (users pasting crash logs with profanity) fell from 12% to 4% after allowlisting stack-trace patterns.

Technique decision table

Approach Best for Weakness
System-prompt-only (“be nice”) Prototypes, zero engineering time No enforcement; trivially bypassed; no metrics
Blocklists only Known spam URLs, exact slur lists Paraphrase and multilingual evasion
Vendor moderation API Fast launch, maintained labels Vendor lock-in, latency, opaque drift
Self-hosted classifier Data residency, custom categories You own retraining and eval debt
Full guardrails platform Multi-step agents, colang-style flows Heavier ops; still not injection-proof
Human moderation only Small closed betas Does not scale; reactive not preventive

Most teams combine blocklists + vendor or open classifier + sampled LLM judge + HITL appeals. Relying on prompt injection defenses without moderation is backwards: injection often aims to bypass policy, not merely exfiltrate data.

Common pitfalls

  • Moderating only user text — model and RAG outputs cause most brand-damage incidents.
  • Single global threshold — harassment scores differ by language; one cutoff under-blocks everywhere or over-blocks somewhere.
  • No streaming egress scan — harmful tokens reach the UI before post-hoc batch jobs run.
  • Opaque refusals — users retry with rephrased attacks; category-specific copy reduces adversarial loops.
  • Ignoring moderator ergonomics — queues without context snippets burn out humans and slow appeals.
  • Classifier scores in public logs — attackers reverse-engineer thresholds from telemetry.
  • Conflating PII and toxicity — masking a username as “toxic” breaks legitimate support threads.
  • No adversarial eval set — leetspeak, homoglyphs, and indirect hate slip through until Twitter finds them.
  • Moderation after analytics export — toxic completions in warehouse tables become training data for the next model.

Production checklist

  • Document policy categories and per-surface risk appetite.
  • Run blocklists on normalized ingress text per supported locale.
  • Scan full assembled context, not only the latest user turn.
  • Implement streaming egress moderation with safe truncation templates.
  • Define three-band thresholds with human review between bands.
  • Separate PII pipeline from toxicity classifiers.
  • Log block events with hashed user id and category, not raw slurs in plaintext.
  • Weekly sample labeled traffic; track precision, recall, and appeal overturn rate.
  • Maintain adversarial red-team set in CI; fail deploy on regression.
  • Route appeals to HITL with full thread snapshot and one-click overturn.
  • Sanitize moderated content before analytics and fine-tuning exports.
  • Publish transparency copy: what is blocked, how to appeal, crisis resources where required.

Key takeaways

  • Moderation is a pipeline placement problem — ingress, context, egress, and tools.
  • Layer blocklists, classifiers, and judges; no single technique survives adversarial use.
  • Threshold bands plus human appeals beat binary block-everything for real products.
  • Streaming requires token-level egress scans, not post-completion batch jobs.
  • Measure drift weekly; slang and attack patterns move faster than model releases.

Related reading