Guide

LLM watermarking and detection explained

Harbor Media's community story contest received 4,200 submissions in one week; moderators suspected that 38% were pasted from external chatbots rather than original prose. A naive keyword blocklist caught almost nothing. Engineers deployed a two-layer stack: a generation-time green-list watermark on the in-house drafting assistant (so any text produced inside Harbor carried a verifiable statistical signature), plus a post-hoc detector on pasted uploads using a Binoculars-style cross-model likelihood score. False-positive rate on a 2,000-piece human validation set stayed under 1.2% at the chosen threshold; confirmed synthetic spam dropped 71% without banning legitimate non-native writers. The refactor did not try to “prove” authorship in court — it gave moderators a ranked queue and an appeals path.

LLM watermarking embeds a detectable signal during token sampling; synthetic-text detection tries to classify prose as machine- or human-written after the fact, often without access to the generator. Neither is perfect: paraphrasing, light editing, and multilingual translation erode both. This guide covers green-list logits biasing, soft watermark variants, statistical detectors (DetectGPT, Binoculars), classifier and embedding approaches, attack surfaces, content-provenance standards like C2PA, the Harbor Media moderation refactor, a technique decision table against guardrails and human review, pitfalls, and a production checklist.

Why authenticity signals matter

Platforms face three overlapping problems that watermarking and detection address differently:

Spam and policy abuse — bulk LLM posts in forums, reviews, and contests that crowd out human contributors.
Misinformation risk — plausible synthetic news or medical advice without accountable sourcing (detection is a weak substitute for fact-checking, but helps triage).
Trust and disclosure — readers and regulators increasingly expect labels when content is model-assisted; some jurisdictions are moving toward mandatory disclosure for certain categories.

Watermarking answers: “Did this model emit this text with our marking scheme?” Detection answers: “Does this text look like typical LLM output?” The second is broader but noisier; the first is precise only when you control generation and the secret key.

Generation-time watermarking: green lists and logits bias

The dominant open recipe (Kirchenbauer et al.) partitions the vocabulary into a green list and red list per context. The split is pseudorandom from a secret key and the preceding tokens, so only someone with the key knows which tokens are green. During sampling, logits for green tokens receive a small positive bias δ; the model still chooses freely but statistically favors green tokens more often than an unwatermarked model would.

A detector with the key counts green-list hits across a passage and runs a hypothesis test (z-score). Longer texts yield higher confidence. Key properties:

Low perplexity impact at modest δ — quality degradation is often small compared to the detectability gain.
Key secrecy — publishing the key or the exact bias algorithm lets attackers wash the watermark by resampling toward red tokens.
Provider-only by default — API vendors can watermark their completions; third parties cannot verify without cooperation.

Variants include soft watermarks (continuous bias instead of hard lists), syntax-aware marks that skip code blocks, and multi-key rotation so a leaked key does not retroactively unmark all historical text. Hugging Face and several inference stacks expose experimental watermark hooks; production deployments should version the scheme and log which key ID was active per completion.

Post-hoc detection without a watermark key

When text arrives from unknown sources — pasted essays, scraped articles, competitor APIs — you need detectors that do not rely on a shared secret.

Likelihood and curvature methods

DetectGPT (and successors) observe that LLM-generated passages often sit near local maxima of the model's log-likelihood: small perturbations (mask-and-resample nearby words) reduce likelihood more for machine text than for human text. The curvature statistic aggregates these drops into a single score. It is model-dependent (you need a strong scorer LM) and slower than a forward pass because perturbations multiply compute.

Binoculars compares two models: a performer and an observer. Machine text tends to be highly predictable to both, with a characteristic ratio between their perplexities. Binoculars often beats raw perplexity thresholds on short passages and is a common choice for moderation queues.

Supervised classifiers

Fine-tuned RoBERTa- or DeBERTa-style classifiers on human vs synthetic corpora can be fast at inference. They overfit to generator families seen in training and degrade when a new model (e.g. a reasoning-native release) shifts style. Retrain on fresh synthetic data from your threat models, stratified by domain and length.

Embeddings and ensemble scores

Some pipelines embed passages and compare to centroids of known human or synthetic clusters, or stack detector outputs with a lightweight meta-classifier. Ensembles reduce single-method false positives but add operational complexity.

Attacks, limitations, and what detection cannot do

Assume motivated adversaries. Common evasions:

Paraphrase — another LLM rewrite strips green-list statistics and flattens curvature signals; this is the hardest attack for both watermark verification and DetectGPT-style methods.
Human edit passes — even 15–20% token substitution by a human co-author can drop detection AUC materially; policy must define “substantially machine-generated.”
Translation round-trips — generate in English, publish in Spanish via machine translation; detectors trained on one language lag.
Adversarial suffixes — appended gibberish optimized against a detector (similar in spirit to jailbreak suffix attacks on red-team evals).
False positives on formulaic human text — legal boilerplate, ESL learners with repetitive structure, and technical docs can score “synthetic.” Never auto-ban on detector score alone.

Detection is not proof of factual correctness or intent. A human can publish harmful truth; a model can publish harmless fiction. Use scores for queue priority, disclosure badges, and rate limits — not sole grounds for account termination without appeal.

Content provenance: C2PA and signed metadata

Cryptographic content provenance (C2PA, Content Credentials) attaches signed manifests to images, audio, and increasingly text workflows: which tool created or edited an asset, when, and under what policy. Unlike statistical watermarks, provenance can survive format changes if clients preserve manifests — but adoption is patchy on the open web, and screenshots strip metadata.

Practical split: use watermarking on your own generator for first-party assistant output; use detectors on untrusted uploads; adopt C2PA where your CMS and CDN preserve credentials (common for images and video pipelines). The three layers complement rather than replace each other.

Harbor Media refactor: worked example

Harbor Media's contest pipeline before refactor: single RoBERTa detector, 4.8% false-positive rate on human stories, weekly moderator burnout. After refactor:

In-house drafting assistant — green-list watermark with δ = 2.0, key rotated monthly; exported drafts embed invisible provenance in JSON sidecar for internal audit.
Upload path — Binoculars score (Llama-class performer + smaller observer) plus length-normalized threshold; scores 0.85+ auto-flag for review, 0.95+ soft-reject with appeal form.
Human band — 0.40–0.85 routed to senior moderators with side-by-side human baseline examples; < 0.40 auto-accept unless other abuse signals fire.
Appeals — rejected authors submit revision; second pass ignores prior score if > 30% tokens changed.
Metrics — track precision/recall weekly on 200 fresh human-labeled samples; retrain classifier quarterly on new generator outputs from synthetic data pipelines.

Spam volume fell 71%; appeals upheld 18% of soft-rejects (mostly ESL writers); in-house watermarked assistant usage rose because creators trusted disclosure would not penalize platform-native drafts.

Technique decision table

Approach	Best when	Weakness
Green-list watermark	You control the generator; need verifiable first-party marks	Useless on external paste; key leak breaks scheme
Binoculars / DetectGPT	Untrusted uploads; no API cooperation; moderate length text	Paraphrase evasion; compute cost; domain shift
Supervised classifier	High volume, fixed latency budget, known generator mix	Stale after new model releases; false positives on formulaic humans
C2PA provenance	Media assets in supported toolchain; regulatory disclosure	Stripped by reposts; limited plain-text adoption
Human review	High-stakes decisions, appeals, edge cases	Does not scale to millions of posts
Guardrails only	Blocking harmful content, not authorship	No synthetic vs human signal

Evaluation metrics

AUC-ROC on balanced human vs synthetic holdout (report per length bucket: < 100, 100–500, 500+ tokens).
False-positive rate at operational threshold — optimize for human harm, not headline accuracy.
Detection rate after paraphrase attack — run a standard paraphrase model on synthetic set; expect large drops.
Watermark detectability (z-score) at fixed false-positive rate on non-watermarked human text.
Perplexity delta from watermark bias on your quality eval set.
Appeal uphold rate — operational proxy for moderator trust.

Common pitfalls

Auto-ban on detector score — guarantees PR crises and discrimination against non-native writers; always human review above threshold.
Training detectors only on GPT-4 outputs — open-weight models with different style will sail through.
Ignoring text length — scores are unreliable below ~50 tokens; skip or merge short comments.
Publishing watermark keys or bias parameters — removes security margin immediately.
Conflating synthetic detection with plagiarism — original LLM prose is still synthetic; copy-paste from humans is a different signal.
No appeal path — detectors err; contests and marketplaces need revision workflows.
Skipping locale stratification — calibrate thresholds per language or accept higher FPR in low-resource locales.

Production checklist

Define policy: what counts as machine-generated vs assisted vs human.
Watermark first-party generators with rotated keys and versioned schemes.
Choose post-hoc detector (Binoculars, classifier, or ensemble) per threat model.
Calibrate thresholds on stratified human validation set; target FPR < 2%.
Route borderline scores to human review; never hard-delete on score alone.
Log scores, detector version, and watermark key ID for appeals.
Red-team with paraphrase, translation, and adversarial suffix attacks monthly.
Refresh synthetic training corpora when major new models launch.
Disclose labeling policy to users; offer assisted-draft badges for in-app generation.
Pair detection with content-quality and abuse guardrails, not as replacement.

Key takeaways

Green-list watermarking biases token sampling toward a secret per-context vocabulary split; verification requires the key.
Post-hoc detectors (Binoculars, DetectGPT, classifiers) score untrusted text but lose to paraphrase and shift when new models appear.
C2PA provenance complements statistical methods for media pipelines where credentials are preserved.
Operational success depends on calibrated thresholds, human appeals, and never treating detection as ground truth.
Harbor Media cut synthetic spam 71% with watermarked in-app drafts plus Binoculars scoring on uploads and a human review band.