Guide

LLM multi-agent debate explained

Harbor Analytics' overnight fraud cascade flagged a merchant for “likely synthetic identity” with 0.91 confidence. A single GPT-4-class reviewer read transaction graphs and KYC fields once, cited two weak signals, and recommended account freeze. Compliance reopened the case: the “synthetic” label came from a name transliteration mismatch, not fraud. The team replaced the one-shot reviewer with a multi-agent debate pipeline inspired by Du et al.’s 2023 work on improving factuality through adversarial discussion: a proponent argued fraud, a critic argued benign explanations, and a moderator scored which side cited verifiable evidence from the case file. After two rebuttal rounds, the moderator downgraded the case to manual review instead of auto-freeze. False-positive auto-actions on held-out fraud tickets dropped 34% while true-positive recall held within 1.2 percentage points.

Multi-agent debate is an inference-time pattern where multiple LLM agents take assigned roles, exchange arguments over several rounds, and either converge on an answer or defer to a judge model. Unlike parallel draft-and-merge approaches, debate forces explicit disagreement: each agent sees others' claims and must rebut or concede. This guide covers debate protocols (symmetric vs adversarial, round structure, stopping rules), moderator and judge design, pairing with tools and RAG, the Harbor Analytics refactor, a technique decision table vs mixture of agents and chain-of-verification, pitfalls, and a production checklist.

What multi-agent debate is

In a standard single-model call, one forward pass produces an answer. Errors that feel locally plausible rarely get challenged. Debate introduces social pressure in the transcript: agents must respond to counterarguments, which surfaces missing evidence and arithmetic mistakes that a lone model skips.

A minimal debate loop has three phases per round:

Opening statements — each agent states its position and supporting evidence (from prompt context, retrieved documents, or tool outputs).
Rebuttals — agents read all prior statements and attack weak claims, cite contradictory facts, or concede points.
Round summary (optional) — a lightweight summarizer compresses the round into disputed facts vs agreed facts before the next round.

After R rounds, a moderator or LLM judge reads the full debate transcript and emits the final label, answer, or confidence score. Research variants include symmetric debate (all agents argue their own sampled answer) and assigned-role debate (one must defend “fraud,” one must defend “legitimate,” regardless of prior belief).

Debate protocols and round design

Protocol choice matters more than adding a fourth debater. Common patterns:

Symmetric multi-sample debate

Each agent independently solves the task, then debates which solution is correct. Works well for math and logic where answers are discrete. Similar to self-consistency but replaces majority vote with argumentation — useful when the majority is wrong but one minority derivation is correct.

Assigned adversarial roles

One agent must argue the prosecution case, one the defense, even if both models would initially agree. Prevents premature consensus on the wrong conclusion. Harbor Analytics used this for fraud: the critic was prompted to find benign explanations for every signal the proponent cited.

Moderator-in-the-loop

A stronger model interrupts long tangents, asks clarifying questions, and ends rounds when no new verifiable facts appear. Reduces token burn from repetitive rebuttals.

Tool-grounded debate

Debaters may only cite claims backed by tool results (SQL queries, calculator, policy search). The moderator rejects arguments without tool citations. Pairs naturally with agentic RAG when each side runs its own retrieval pass.

Typical production settings use 2–3 debaters and 2–4 rounds. Diminishing returns appear quickly; round three often repeats round two unless new tools are invoked between rounds.

Stopping rules and consensus

Debate without termination criteria runs until the context window fills. Practical stop conditions:

Agreement — all agents emit the same final answer (extracted via structured JSON). Fast but risky if they converge on a shared hallucination.
No-new-evidence — end when a round introduces zero new tool calls or citation IDs. Harbor Analytics used this plus a hard cap of three rounds.
Judge early exit — moderator scores confidence after each round; exit when confidence exceeds threshold or falls below escalation floor.
Budget cap — max tokens or dollars per case; degrade to single-shot judge if debate budget exhausts.

For high-stakes decisions, treat “unanimous agreement” as insufficient. Require the judge to list residual uncertainties and route low-margin outcomes to humans. Debate improves calibration; it does not eliminate tail risk.

Moderator and judge prompts

The moderator is not a passive transcript reader. Effective judge prompts include:

The original task and rubric (fraud vs legitimate, correct numeric answer, policy compliance).
Evidence hierarchy — e.g., “Tool-verified facts beat rhetorical claims; retrieved policy text beats model memory.”
Instructions to output structured verdicts: winner, confidence, key_disputes[], unsupported_claims[].
Explicit ban on introducing new facts not raised in debate unless the judge runs a verification tool pass.

Use a model at least as capable as the debaters for the judge role. A small judge model often rubber-stamps whichever debater sounds more confident, reintroducing the single-model failure mode.

Harbor Analytics refactor (worked example)

Before: One-shot fraud reviewer on GPT-4o with top-12 RAG features; 19% false-positive rate on auto-freeze queue; auditors could not see why the model preferred freeze over review.

After (assigned-role debate):

Shared feature bundle: transaction graph summary, KYC fields, merchant history (same retrieval for both sides).
Proponent (GPT-4o): “Argue strongest fraud case; cite feature IDs.”
Critic (Claude Sonnet): “Argue legitimate; explain each proponent signal benignly; cite feature IDs.”
Round 1 openings, Round 2 rebuttals (parallel within round).
Moderator (GPT-4o): structured JSON verdict + recommended action (auto-freeze / manual review / clear).
If confidence < 0.7 or unsupported_claims non-empty, force manual review regardless of winner.

Guardrails: Debaters could not invent feature IDs; a post-step validator checked every cited ID against the bundle. Latency p95 rose from 3.1s to 9.8s (within 12s SLA). Cost per scored case rose from $0.011 to $0.041 — cheaper than a human fraud analyst touch at $6.50.

Results: False-positive auto-freezes down 34%; recall down 1.2pp (within noise on monthly eval); audit satisfaction up because transcripts showed both sides' reasoning.

Technique decision table

Technique	Prefer when	Avoid when
Multi-agent debate	Binary or multi-class decisions; adversarial fact-checking; errors hide in confident tone	Customer-facing chat needing warm tone; sub-5s latency; creative writing
Mixture of agents	Synthesize diverse drafts into one polished answer; prose quality matters	Need explicit challenge of a dominant wrong hypothesis
Self-consistency	Extractable numeric or multiple-choice answers; same model, cheap samples	Errors are correlated across samples; need role diversity
Chain-of-verification	Single draft with planned fact-check questions; lower token cost than debate	Initial draft bias steers all verification questions
Tree of thought	Search over reasoning paths with scoring	Dispute is about evidence interpretation, not branch exploration
Single model + RAG	High retrieval quality; low ambiguity; cost-sensitive	Systematic false confidence on edge cases

Common pitfalls

Debate theater — agents trade rhetoric without citing verifiable evidence. Require citation IDs or tool outputs per claim.
Shared blind spots — homogeneous debaters agree on wrong facts. Use different model families or tool access per role.
Consensus on hallucination — stopping at unanimous agreement without judge scrutiny. Always run a judge pass with evidence rules.
Unbounded rounds — agents repeat the same rebuttal. Cap rounds and use no-new-evidence stopping.
Leaking the answer in the prompt — giving debaters the ground-truth label during training eval is fine; in production, roles must not see internal risk scores.
Ignoring tone externalities — debate transcripts are adversarial; never show raw debate to end users. Publish only the judge's sanitized output.
Weak judge — a small model picks the louder debater. Match judge capability to task stakes.
No transcript logging — compliance and debugging require full round-by-round logs with model versions.

Production checklist

Define decision rubric and acceptable error trade-offs (precision vs recall).
Choose protocol: symmetric, assigned roles, or tool-grounded debate.
Assign 2–3 debaters with intentional diversity (model family, prompt stance).
Write opening and rebuttal prompts with citation or tool requirements.
Implement round caps and no-new-evidence stopping.
Deploy a capable moderator with structured verdict schema.
Validate cited evidence IDs against source bundles post-debate.
Route low-confidence and disputed cases to human review.
Log full transcripts, model IDs, and judge outputs for audit.
Ablate rounds and debater count on held-out set; measure FP/FN vs cost and latency.

Key takeaways

Multi-agent debate forces explicit rebuttal — agents must respond to counterarguments instead of publishing one confident draft.
Assigned adversarial roles (proponent vs critic) reduce premature consensus on the wrong conclusion.
Harbor Analytics cut false-positive auto-freezes by 34% with a two-round debate plus structured moderator verdict.
Stopping rules and evidence-grounded citations matter more than adding debaters — debate theater wastes tokens without improving accuracy.
Use mixture of agents for prose synthesis; use debate for high-stakes classification, fraud review, and disputed fact patterns.