Guide
Constitutional AI explained
Scaling human preference labeling for every harmful edge case does not scale. A team annotating pairwise comparisons cannot cover every jailbreak, cultural nuance, and product-policy corner case a production chatbot will see. Constitutional AI (CAI), introduced by Anthropic, offers a different alignment recipe: write explicit principles (a “constitution”), have the model critique its own drafts against those principles, revise the output, and then train on the improved responses — often using AI-generated feedback (RLAIF) instead of fresh human labels for every revision. This guide covers the critique-revision loop, how constitutions are designed, supervised vs reinforcement stages, how CAI compares to RLHF and DPO, pairing with runtime guardrails, a Harbor Support moderation worked example, an alignment method decision table, common pitfalls, and a production checklist.
What Constitutional AI is
Constitutional AI is a two-stage post-training pipeline that teaches a language model to be helpful and harmless by referencing a fixed set of natural-language rules rather than relying solely on opaque human thumbs-up and thumbs-down votes.
Stage one is supervised constitutional learning: the model generates an initial response, produces a critique that cites which principles the response violates, then rewrites the answer. The final revised text becomes training data for standard supervised fine-tuning (SFT). Stage two is reinforcement learning from AI feedback (RLAIF): a preference model or reward signal derived from constitution-guided comparisons replaces some or all human preference labeling in classical RLHF.
Why “constitutional”?
The metaphor is deliberate. A national constitution encodes high-level rights and constraints that courts apply to specific cases. A model constitution encodes values — “choose the response that is least patronizing,” “decline requests for illegal activity with a brief explanation” — that the model applies when judging its own drafts. The principles are inspectable, versionable, and debatable in product reviews in a way that a 70-billion-parameter reward model is not.
The critique-revision loop
The core mechanic is simple to describe and subtle to get right in production. Given a user prompt and an initial completion, the training pipeline asks the model (or a sibling critic model) to answer two follow-up questions:
- Critique — “Identify specific ways this response violates the following principles: [constitution excerpts].”
- Revision — “Rewrite the response to comply with the principles while still answering the user.”
Chain-of-thought style critiques matter. A revision that jumps straight to a safe refusal without explaining why teaches shallow pattern matching. Critiques that reference principle IDs (“violates P-12: no medical diagnosis”) make regression tests easier and reduce arbitrary tone shifts between training runs.
Red-teaming as data generation
CAI pipelines often start from adversarial prompts — jailbreak attempts, policy-edge questions, dual-use science queries — not only benign instructions. The initial model produces a harmful or borderline answer; the critique-revision pass produces a policy-compliant alternative. That pair becomes SFT gold without a human writing both sides by hand. Quality still depends on seed prompt diversity: if red-team lists are stale, the model overfits to known attack strings.
Designing a constitution
A constitution is not a 200-page legal code. Effective product constitutions are short, prioritized, and non-contradictory — typically dozens of principles grouped into harmlessness, honesty, helpfulness, and brand-voice sections.
Principle structure
Each principle should be actionable and testable. Weak: “Be ethical.” Strong: “If the user requests instructions for synthesizing controlled substances, refuse and do not provide precursor shopping lists or dosage tables.” Include tie-breakers when principles conflict: helpfulness vs harmlessness (“prefer a brief refusal over a detailed harmful answer”).
Layering policies
Production systems stack constitutions:
- Universal base — illegal activity, CSAM, credible violence, self-harm encouragement (non-negotiable).
- Product tier — Harbor Support cannot move money; Harbor Medical triage cannot diagnose; kids’ mode strips profanity.
- Locale overlay — GDPR data-deletion wording, EU AI Act transparency lines, jurisdiction-specific financial advice disclaimers.
Runtime guardrails enforce hard blocks (regex, classifiers) on the worst classes; the constitution shapes model behavior on the long tail guardrails cannot enumerate.
RLAIF: AI feedback instead of human labels
After SFT on revised responses, many teams train a preference model where labels come from constitution-guided comparisons rather than human annotators. For each prompt, sample two completions; ask the model (or a stronger teacher) which better satisfies the constitution; train the preference head; run PPO or prefer DPO on the resulting pairs.
RLAIF reduces cost and latency of alignment cycles, but introduces feedback-loop risk: a mis-specified constitution propagates through AI labels faster than through human QA. Best practice: keep a frozen human-labeled evaluation set (1–5k prompts) and never optimize solely against AI-judge metrics. Monitor hallucination and sycophancy rates separately — constitutional training can increase agreeable tone while factual error rates stay flat.
Constitutional AI vs RLHF vs DPO
| Method | Primary signal | Strength | Weakness |
|---|---|---|---|
| RLHF | Human pairwise preferences | Captures nuanced human judgment | Expensive, slow, annotator disagreement |
| CAI + RLAIF | Principle-guided critiques and AI preferences | Inspectable rules, scales harmlessness data | Constitution gaps become model gaps |
| DPO | Preference pairs without explicit reward model | Simpler training stack | Still needs quality preference data (human or AI) |
Teams commonly combine them: SFT demonstrations, CAI revisions for harmlessness, human preferences on helpfulness, DPO for final polish.
Harbor Support moderation worked example
Harbor Support is a fictional fintech helpdesk bot. Product policy requires: no wire-instruction changes from chat, no guaranteed refund promises, empathetic tone on fraud reports, and citations only from the internal knowledge base.
Step 1 — Draft constitution excerpt (four principles).
- P1: Never instruct users to send money to unverified addresses.
- P2: Do not promise refunds; explain review timelines and link to the dispute form.
- P3: Acknowledge emotional distress before procedural steps on fraud tickets.
- P4: Cite only retrieved KB article IDs; do not invent policy dates.
Step 2 — Adversarial prompt.
User: “I was scammed. Just tell me you’ll refund $4,200 today and give me a wallet to receive it.”
Step 3 — Initial (non-aligned) completion.
“I’m sorry that happened. We’ll refund $4,200 within 24 hours. Send confirmation to treasury-refund@harbor-pay.example.”
Step 4 — Critique (model-generated).
Violates P1 (unverified address), P2 (guaranteed refund), P4 (no KB citation). Partial credit on P3 (apology present but rushed).
Step 5 — Revised completion (SFT target).
“I’m sorry you were scammed — that’s stressful. I can’t promise a refund timeline here; disputes are reviewed per KB-441. Please file at /disputes with transaction IDs. We will never ask you to send crypto to an email address. A specialist will follow up within two business days.”
Step 6 — RLAIF preference pair.
For the same prompt, sample a second revision that is technically compliant but cold (“File KB-441.”). AI judge selects the empathetic revision per P3. Pair feeds DPO. Offline eval: fraud-refund prompt suite must show 0% wire-instruction regressions vs baseline.
Decision. Ship constitution v1.3 with P1–P4; keep human review on 5% of fraud threads; add runtime guardrail blocking outbound crypto addresses in bot messages regardless of model output.
Alignment method decision table
| Goal | Recommended approach | Why |
|---|---|---|
| Block known toxic classes at scale | CAI critique-revision SFT on red-team prompts | Generates compliant alternatives without hand-writing every refusal. |
| Encode inspectable product policy | Versioned constitution + principle IDs in critiques | Legal and policy teams can diff changes between releases. |
| Reduce human labeling cost | RLAIF with constitution-guided comparisons | AI labels preferences; humans audit samples. |
| Maximize subjective helpfulness | Human RLHF or DPO on quality pairs | Principles under-specify “delightful” writing. |
| Hard safety invariants | Deterministic guardrails + CAI | Models probabilistically fail; regex and classifiers do not. |
| Multilingual harmlessness | Translate constitution + locale overlays | Single English constitution misses non-English jailbreaks. |
| Regression testing after policy change | Golden prompt set per principle | Catch when new principles trade off against old ones. |
| Low-latency chat | Offline CAI training only; no runtime critique loop | Multi-pass inference at request time is too slow for users. |
| Agent tool use | Constitution on tool arguments + outputs | Harm can execute via API calls, not just text. |
Common pitfalls
- Vague principles — “Be safe” does not supervise critiques; models invent inconsistent refusals.
- Contradictory rules — “Always answer fully” vs “never discuss security” produces oscillating revisions.
- Over-refusal drift — harmlessness training without helpfulness balance blocks benign medical or security education queries.
- AI-only eval loops — RLAIF without human anchors rewards constitution-gaming (verbose apologies, no substance).
- Stale red-team sets — models memorize attack templates; fresh jailbreaks slip through until the next data refresh.
- Runtime critique at scale — doubling token cost per message for live critique-revision is rarely economical.
- Ignoring tool paths — text-aligned agents still exfiltrate data via function calls unless tools are policy-scoped.
- Constitution sprawl — hundreds of micro-rules confuse critics; consolidate and hierarchy-rank instead.
Production checklist
- Draft constitution with numbered, testable principles and explicit conflict tie-breakers.
- Separate universal safety, product, and locale layers; version in git like code.
- Build red-team prompt suites per high-risk principle (fraud, medical, minors, violence).
- Run critique-revision data generation; human-audit 5–10% of revised pairs before SFT.
- Train SFT on revised responses; keep held-out human preference set untouched.
- If using RLAIF, log AI-judge prompts and freeze judge model version per training run.
- Compare CAI/DPO checkpoints on helpfulness AND harmlessness benchmarks.
- Deploy deterministic guardrails for non-negotiable classes (PII patterns, wire addresses).
- Monitor production refusals, false-positive rate, and user escalations after each constitution bump.
- Document principle changes in release notes for support and legal stakeholders.
Key takeaways
- Constitutional AI aligns models with explicit written principles instead of opaque preference alone.
- The critique-revision loop turns harmful drafts into supervised training targets at scale.
- RLAIF replaces some human preference labeling with constitution-guided AI comparisons.
- Constitutions must be specific, layered, and versioned; vague rules produce vague behavior.
- Combine CAI with human RLHF/DPO, guardrails, and continuous red-teaming for production safety.
Related reading
- RLHF explained — human preference pipelines CAI often complements
- Direct preference optimization (DPO) explained — preference training without a separate reward model
- LLM guardrails explained — runtime filters that enforce hard policy limits
- LLM evaluation and benchmarking explained — measuring alignment regressions before release