Guide

LLM mixture of agents explained

Harbor Support’s policy Q&A bot handled straightforward fee and limit questions well, but complex cases — “Can a joint account holder dispute an international wire after the 45-day window if fraud was reported to local police but not to Harbor?” — triggered confident wrong answers from a single GPT-4-class model. Retrieval helped, yet one generator still collapsed nuanced exceptions into a binary yes/no. The team rebuilt the answer path as a mixture of agents (MoA): three heterogeneous proposer models drafted independent answers in parallel, a stronger aggregator model synthesized the best reasoning, and a second layer repeated the pattern on the merged draft. Offline eval on 240 held-out policy tickets improved exact-policy citation accuracy from 71% to 86% and cut escalations to human agents by 19%, at roughly 4.2× token cost versus single-shot generation. The win came from diverse drafts plus explicit synthesis, not from running the same model five times.

Mixture of agents is an inference-time architecture where multiple LLM “proposers” generate candidate responses to the same prompt, then one or more “aggregator” models read all proposals and produce a refined answer — optionally stacked across layers like a neural network. Popularized by Together AI’s 2024 MoA paper on open models, the pattern exploits complementarity: a smaller fast model may catch a clause a larger model misses, and the aggregator can resolve contradictions without hand-written voting rules. This guide covers the proposer-aggregator loop, layer depth and diversity tuning, the Harbor Support refactor, a technique decision table vs self-consistency and multi-agent debate, pitfalls, and a production checklist — building on our AI agents and tool use explainer.

How mixture of agents works

MoA treats answer generation as a collective refinement pipeline rather than a single forward pass:

  1. Input — user question plus any retrieved context (RAG chunks, tool outputs, policy excerpts).
  2. Proposer layerN LLMs (often different families, sizes, or system prompts) each produce a full draft answer independently, with temperature > 0 for diversity.
  3. Aggregator layer — one model receives the original question and all N proposals, instructed to synthesize the strongest reasoning, resolve conflicts, and emit a single improved answer.
  4. Optional deeper layers — the aggregator’s output becomes input to another proposer round (multiple aggregators propose; a meta-aggregator merges) for hard tasks.

Unlike mixture of experts (MoE) inside a single pretrained model, MoA is an orchestration pattern at serving time. You can add or swap proposers without retraining. Latency grows with layer count and proposer parallelism, but proposers run concurrently so wall-clock is often dominated by the slowest proposer plus one aggregator call per layer.

Proposer diversity: the main lever

MoA gains come primarily from uncorrelated errors. Running the same model five times with different seeds helps less than combining models trained on different data, architectures, or instruction styles. Practical diversity axes:

  • Model family — e.g., Claude + GPT + Llama proposers; aggregators are often a frontier model strong at synthesis.
  • Size tier — a 8B model may quote numeric tables literally while a 70B model paraphrases; the aggregator can prefer the literal cite.
  • System prompt stance — one proposer “conservative, cite only retrieved text,” another “explain implications for the customer,” a third “list edge cases and exceptions.”
  • Tool access — one proposer with calculator or policy-search tools, others text-only; aggregator merges tool-grounded facts with narrative clarity.

Harbor Support used three proposers: a retrieval-heavy agentic RAG agent (tools + two retrieval passes), a fast small model for literal policy quotes, and a frontier model for customer-facing tone. The aggregator prompt explicitly asked: “Prefer verbatim policy citations from Proposal B when numbers or dates conflict; use Proposal C for clarity; discard unsupported claims.”

Aggregator prompts and conflict resolution

The aggregator is not a simple majority vote. Effective prompts include:

  • The original user question unchanged.
  • All proposer outputs labeled Proposal A/B/C (with model metadata for debugging).
  • Instructions to identify agreements, contradictions, and gaps before writing the final answer.
  • Output format constraints — e.g., JSON with fields citations[], answer, confidence, unresolved_flags[] for downstream routing.

When proposers disagree on a material fact, the aggregator should either (a) choose the proposal best supported by retrieved sources, (b) trigger a re-retrieval tool call if wired, or (c) emit a low-confidence flag for human review. Silent averaging of contradictions produces the worst of both worlds: fluent but wrong.

For regulated domains, log the full proposal bundle per request. Auditors and LLM-as-judge evaluators need to see which proposer introduced an error and whether the aggregator fixed or amplified it.

Layer depth, width, and cost

Width = proposers per layer; depth = number of proposer-aggregator rounds. Together’s published stacks used 2–3 layers with 4–6 proposers, beating single-model baselines on AlpacaEval and MT-Bench with open-weights models. Diminishing returns appear quickly:

  • One layer (3 proposers + 1 aggregator) captures most of the benefit on many enterprise Q&A tasks.
  • Two layers help when first-pass proposals are noisy but contain complementary partial solutions — legal, medical, and multi-policy synthesis.
  • Three+ layers rarely justify cost unless eval proves a measurable jump and latency SLAs allow 15–30+ seconds end-to-end.

Token math: if each proposer emits 400 tokens and the aggregator reads 1,200 tokens of proposals plus writes 400, one layer with three proposers costs roughly 3×(input) + 4×(output) versus one 400-token answer. Budget caps per layer (max tokens per proposer, early stop if two proposers agree verbatim on citations) keep spend predictable.

Harbor Support refactor (worked example)

Before: Single GPT-4o call with top-8 RAG chunks; 71% citation accuracy on compliance-reviewed tickets; 22% routed to tier-2 humans for policy ambiguity.

After (two-layer MoA):

  1. Shared retrieval pass feeds all layer-1 proposers the same chunk bundle.
  2. Proposer 1: agentic RAG sub-agent (rewrite query, re-fetch if grade < 0.7).
  3. Proposer 2: Llama-3.1-8B with “quote policy section IDs only” prompt.
  4. Proposer 3: GPT-4o with customer-tone system prompt, no tools.
  5. Aggregator 1: Claude Sonnet synthesizes unified draft + citation list.
  6. Layer 2: two proposers (GPT-4o mini, Gemini Flash) critique the draft for missing exceptions; Aggregator 2 (GPT-4o) produces final customer reply.

Guardrails: If Aggregator 1 unresolved_flags is non-empty, skip layer 2 and route to human. If any proposer latency exceeds 8s, proceed with available proposals (degraded MoA, not full timeout). Cache retrieval embeddings across proposers to avoid triple index lookups.

Results: 86% citation accuracy, 18% tier-2 escalation rate, p95 latency 11.4s (SLA 15s). Cost per answered ticket rose from $0.009 to $0.038 — acceptable where human tier-2 costs $4.20 per touch.

Technique decision table

Technique Prefer when Avoid when
Mixture of agents Hard synthesis; diverse model fleet; quality worth 3–5× tokens Sub-second chat; trivial FAQs; single model already >95% accurate
Self-consistency Same model, reasoning tasks with verifiable answer extraction Need stylistic or citation diversity; proposals too correlated
Multi-agent debate Adversarial fact-checking; math proofs; explicit rebuttal rounds Customer-facing tone must stay cooperative; latency budget tight
Tree of thought Search over reasoning branches with scoring Final answer is prose synthesis, not a single extracted value
Single model + RAG High retrieval quality; simple questions; cost-sensitive Systematic blind spots on multi-clause policy or multi-hop evidence
Model routing / cascade Predictable difficulty tiers; cheap model handles 80% Hard cases need multiple perspectives, not one bigger model

Common pitfalls

  • Homogeneous proposers — five samples from one model behave like expensive self-consistency without MoA’s synthesis benefit.
  • Aggregator overload — stuffing ten long proposals exceeds context or dilutes attention; cap proposer length and count.
  • Error amplification — a persuasive but wrong proposer can poison synthesis; include source-grounding instructions and citation requirements.
  • No eval per layer — without ablation (1 vs 2 vs 3 proposers), teams pay 4× cost for 2% gain.
  • Ignoring latency SLOs — sequential aggregators across layers add up; parallelize proposers aggressively.
  • Secret proposer variance — changing proposer models without versioned prompts breaks regression tests; pin model IDs and prompts.
  • Skipping human path — MoA reduces errors, not eliminates them; keep confidence-based escalation.
  • Copyright / data leakage — sending customer data to many third-party APIs multiplies DPA surface; document which proposers see PII.

Production checklist

  • Define quality metric and baseline single-model score on held-out set.
  • Select 2–4 proposers with intentional diversity (family, size, prompt, tools).
  • Write aggregator prompt with conflict-resolution rules and output schema.
  • Implement parallel proposer calls with per-proposer timeout and fallback.
  • Share retrieval context across proposers; avoid redundant index queries.
  • Log proposals, aggregator inputs, and final outputs for audit and eval.
  • Ablate width and depth; stop when marginal accuracy gain < cost threshold.
  • Set token and dollar budget per request; degrade to fewer proposers under load.
  • Wire low-confidence and unresolved_flags to human or specialist agent.
  • Monitor p95 latency, cost per ticket, and escalation rate in production dashboards.

Key takeaways

  • Mixture of agents runs diverse LLM proposers in parallel, then an aggregator synthesizes their drafts — optionally across multiple layers.
  • Harbor Support improved policy citation accuracy from 71% to 86% with a two-layer MoA stack at roughly 4× token cost.
  • Diversity matters more than raw proposer count — heterogeneous models and prompts beat five samples from one model.
  • The aggregator prompt must explicitly handle contradictions, citations, and confidence — not silently blend wrong facts.
  • Ablate layers and width on your eval set; MoA is a quality knob, not a default for every chat endpoint.

Related reading