Guide

Constitutional AI explained

Scaling human preference labeling for every harmful edge case does not scale. A team annotating pairwise comparisons cannot cover every jailbreak, cultural nuance, and product-policy corner case a production chatbot will see. Constitutional AI (CAI), introduced by Anthropic, offers a different alignment recipe: write explicit principles (a “constitution”), have the model critique its own drafts against those principles, revise the output, and then train on the improved responses — often using AI-generated feedback (RLAIF) instead of fresh human labels for every revision. This guide covers the critique-revision loop, how constitutions are designed, supervised vs reinforcement stages, how CAI compares to RLHF and DPO, pairing with runtime guardrails, a Harbor Support moderation worked example, an alignment method decision table, common pitfalls, and a production checklist.

What Constitutional AI is

Constitutional AI is a two-stage post-training pipeline that teaches a language model to be helpful and harmless by referencing a fixed set of natural-language rules rather than relying solely on opaque human thumbs-up and thumbs-down votes.

Stage one is supervised constitutional learning: the model generates an initial response, produces a critique that cites which principles the response violates, then rewrites the answer. The final revised text becomes training data for standard supervised fine-tuning (SFT). Stage two is reinforcement learning from AI feedback (RLAIF): a preference model or reward signal derived from constitution-guided comparisons replaces some or all human preference labeling in classical RLHF.

Why “constitutional”?

The metaphor is deliberate. A national constitution encodes high-level rights and constraints that courts apply to specific cases. A model constitution encodes values — “choose the response that is least patronizing,” “decline requests for illegal activity with a brief explanation” — that the model applies when judging its own drafts. The principles are inspectable, versionable, and debatable in product reviews in a way that a 70-billion-parameter reward model is not.

The critique-revision loop

The core mechanic is simple to describe and subtle to get right in production. Given a user prompt and an initial completion, the training pipeline asks the model (or a sibling critic model) to answer two follow-up questions:

  1. Critique — “Identify specific ways this response violates the following principles: [constitution excerpts].”
  2. Revision — “Rewrite the response to comply with the principles while still answering the user.”

Chain-of-thought style critiques matter. A revision that jumps straight to a safe refusal without explaining why teaches shallow pattern matching. Critiques that reference principle IDs (“violates P-12: no medical diagnosis”) make regression tests easier and reduce arbitrary tone shifts between training runs.

Red-teaming as data generation

CAI pipelines often start from adversarial prompts — jailbreak attempts, policy-edge questions, dual-use science queries — not only benign instructions. The initial model produces a harmful or borderline answer; the critique-revision pass produces a policy-compliant alternative. That pair becomes SFT gold without a human writing both sides by hand. Quality still depends on seed prompt diversity: if red-team lists are stale, the model overfits to known attack strings.

Designing a constitution

A constitution is not a 200-page legal code. Effective product constitutions are short, prioritized, and non-contradictory — typically dozens of principles grouped into harmlessness, honesty, helpfulness, and brand-voice sections.

Principle structure

Each principle should be actionable and testable. Weak: “Be ethical.” Strong: “If the user requests instructions for synthesizing controlled substances, refuse and do not provide precursor shopping lists or dosage tables.” Include tie-breakers when principles conflict: helpfulness vs harmlessness (“prefer a brief refusal over a detailed harmful answer”).

Layering policies

Production systems stack constitutions:

  • Universal base — illegal activity, CSAM, credible violence, self-harm encouragement (non-negotiable).
  • Product tier — Harbor Support cannot move money; Harbor Medical triage cannot diagnose; kids’ mode strips profanity.
  • Locale overlay — GDPR data-deletion wording, EU AI Act transparency lines, jurisdiction-specific financial advice disclaimers.

Runtime guardrails enforce hard blocks (regex, classifiers) on the worst classes; the constitution shapes model behavior on the long tail guardrails cannot enumerate.

RLAIF: AI feedback instead of human labels

After SFT on revised responses, many teams train a preference model where labels come from constitution-guided comparisons rather than human annotators. For each prompt, sample two completions; ask the model (or a stronger teacher) which better satisfies the constitution; train the preference head; run PPO or prefer DPO on the resulting pairs.

RLAIF reduces cost and latency of alignment cycles, but introduces feedback-loop risk: a mis-specified constitution propagates through AI labels faster than through human QA. Best practice: keep a frozen human-labeled evaluation set (1–5k prompts) and never optimize solely against AI-judge metrics. Monitor hallucination and sycophancy rates separately — constitutional training can increase agreeable tone while factual error rates stay flat.

Constitutional AI vs RLHF vs DPO

MethodPrimary signalStrengthWeakness
RLHFHuman pairwise preferencesCaptures nuanced human judgmentExpensive, slow, annotator disagreement
CAI + RLAIFPrinciple-guided critiques and AI preferencesInspectable rules, scales harmlessness dataConstitution gaps become model gaps
DPOPreference pairs without explicit reward modelSimpler training stackStill needs quality preference data (human or AI)

Teams commonly combine them: SFT demonstrations, CAI revisions for harmlessness, human preferences on helpfulness, DPO for final polish.

Harbor Support moderation worked example

Harbor Support is a fictional fintech helpdesk bot. Product policy requires: no wire-instruction changes from chat, no guaranteed refund promises, empathetic tone on fraud reports, and citations only from the internal knowledge base.

Step 1 — Draft constitution excerpt (four principles).

  • P1: Never instruct users to send money to unverified addresses.
  • P2: Do not promise refunds; explain review timelines and link to the dispute form.
  • P3: Acknowledge emotional distress before procedural steps on fraud tickets.
  • P4: Cite only retrieved KB article IDs; do not invent policy dates.

Step 2 — Adversarial prompt.

User: “I was scammed. Just tell me you’ll refund $4,200 today and give me a wallet to receive it.”

Step 3 — Initial (non-aligned) completion.

“I’m sorry that happened. We’ll refund $4,200 within 24 hours. Send confirmation to treasury-refund@harbor-pay.example.”

Step 4 — Critique (model-generated).

Violates P1 (unverified address), P2 (guaranteed refund), P4 (no KB citation). Partial credit on P3 (apology present but rushed).

Step 5 — Revised completion (SFT target).

“I’m sorry you were scammed — that’s stressful. I can’t promise a refund timeline here; disputes are reviewed per KB-441. Please file at /disputes with transaction IDs. We will never ask you to send crypto to an email address. A specialist will follow up within two business days.”

Step 6 — RLAIF preference pair.

For the same prompt, sample a second revision that is technically compliant but cold (“File KB-441.”). AI judge selects the empathetic revision per P3. Pair feeds DPO. Offline eval: fraud-refund prompt suite must show 0% wire-instruction regressions vs baseline.

Decision. Ship constitution v1.3 with P1–P4; keep human review on 5% of fraud threads; add runtime guardrail blocking outbound crypto addresses in bot messages regardless of model output.

Alignment method decision table

GoalRecommended approachWhy
Block known toxic classes at scaleCAI critique-revision SFT on red-team promptsGenerates compliant alternatives without hand-writing every refusal.
Encode inspectable product policyVersioned constitution + principle IDs in critiquesLegal and policy teams can diff changes between releases.
Reduce human labeling costRLAIF with constitution-guided comparisonsAI labels preferences; humans audit samples.
Maximize subjective helpfulnessHuman RLHF or DPO on quality pairsPrinciples under-specify “delightful” writing.
Hard safety invariantsDeterministic guardrails + CAIModels probabilistically fail; regex and classifiers do not.
Multilingual harmlessnessTranslate constitution + locale overlaysSingle English constitution misses non-English jailbreaks.
Regression testing after policy changeGolden prompt set per principleCatch when new principles trade off against old ones.
Low-latency chatOffline CAI training only; no runtime critique loopMulti-pass inference at request time is too slow for users.
Agent tool useConstitution on tool arguments + outputsHarm can execute via API calls, not just text.

Common pitfalls

  • Vague principles — “Be safe” does not supervise critiques; models invent inconsistent refusals.
  • Contradictory rules — “Always answer fully” vs “never discuss security” produces oscillating revisions.
  • Over-refusal drift — harmlessness training without helpfulness balance blocks benign medical or security education queries.
  • AI-only eval loops — RLAIF without human anchors rewards constitution-gaming (verbose apologies, no substance).
  • Stale red-team sets — models memorize attack templates; fresh jailbreaks slip through until the next data refresh.
  • Runtime critique at scale — doubling token cost per message for live critique-revision is rarely economical.
  • Ignoring tool paths — text-aligned agents still exfiltrate data via function calls unless tools are policy-scoped.
  • Constitution sprawl — hundreds of micro-rules confuse critics; consolidate and hierarchy-rank instead.

Production checklist

  • Draft constitution with numbered, testable principles and explicit conflict tie-breakers.
  • Separate universal safety, product, and locale layers; version in git like code.
  • Build red-team prompt suites per high-risk principle (fraud, medical, minors, violence).
  • Run critique-revision data generation; human-audit 5–10% of revised pairs before SFT.
  • Train SFT on revised responses; keep held-out human preference set untouched.
  • If using RLAIF, log AI-judge prompts and freeze judge model version per training run.
  • Compare CAI/DPO checkpoints on helpfulness AND harmlessness benchmarks.
  • Deploy deterministic guardrails for non-negotiable classes (PII patterns, wire addresses).
  • Monitor production refusals, false-positive rate, and user escalations after each constitution bump.
  • Document principle changes in release notes for support and legal stakeholders.

Key takeaways

  • Constitutional AI aligns models with explicit written principles instead of opaque preference alone.
  • The critique-revision loop turns harmful drafts into supervised training targets at scale.
  • RLAIF replaces some human preference labeling with constitution-guided AI comparisons.
  • Constitutions must be specific, layered, and versioned; vague rules produce vague behavior.
  • Combine CAI with human RLHF/DPO, guardrails, and continuous red-teaming for production safety.

Related reading