Guide

LLM agent canary deployment and shadow traffic systems explained

Harbor Support shipped a new refund-handling agent on a Friday afternoon: updated system prompt, a cheaper model tier, and a rewritten issue_credit tool schema. The deploy script flipped 100% of production traffic in one step. By Monday, refund approval accuracy had fallen from 94% to 71%, duplicate credits hit 3.2% of approved runs, and mean time to resolution stretched 18 minutes. Rollback took four hours because checkpoint state from the new version was incompatible with the old binary. After Harbor rebuilt release plumbing around shadow replay, staged canary splits, and automated promotion gates, the share of deploys causing user-visible regressions dropped from 12% to 0.8% and rollback time fell to under 90 seconds.

Canary deployment routes a small slice of real traffic to a new agent version while the stable version serves everyone else. Shadow traffic runs the candidate against the same inputs as production but discards its outputs — you compare metrics without risking user-facing side effects. Together they form the release layer between offline agent evaluation and full cutover. This guide covers version identity, traffic routing, shadow execution architecture, promotion and rollback gates, metric selection, Harbor’s refactor, a technique decision table, pitfalls, and a production checklist.

Why agents need a different rollout model

Stateless API swaps are easy: flip a load balancer, watch error rates, roll back. Agents are stateful, tool-using, and non-deterministic. A new prompt can pass every offline golden test yet fail on live edge cases. Tool schema changes break idempotent replay. Model swaps alter token budgets and truncation behavior downstream. Multi-step runs span minutes — a mid-run deploy can leave half the conversation on v2 and half on v3.

Big-bang releases optimize for shipping speed until the first bad Friday. Canary and shadow systems trade a little latency and infrastructure cost for bounded blast radius and evidence-based promotion. They pair naturally with feature flags (per-tenant or per-task-class toggles) and distributed tracing so you can attribute metric deltas to a specific agent build.

Version identity and the release artifact

Every deployable agent build needs an immutable release fingerprint propagated through routing, logs, and traces:

Model route — provider, model ID, temperature, max tokens, and fallback ladder slot.
Prompt bundle hash — system prompt, tool descriptions, few-shot examples, and policy snippets versioned together.
Tool manifest revision — JSON Schema per tool, approval gates, and sandbox profile ID.
Middleware stack version — ordered hooks for guardrails, PII scrubbing, rate limits, and cost caps.
Runtime binary — orchestrator code that executes the agent loop; must be backward compatible with in-flight checkpoints.

Store the fingerprint on every span and run record. Promotion decisions compare agent_release=v2026.06.12-a3f9 against agent_release=v2026.06.05-b1c2 — not “the new model.” In multi-tenant setups, bind releases per tenant via tenant context so a canary for Tenant A never leaks into Tenant B.

Shadow traffic architecture

Shadow mode executes the candidate agent on a copy of production inputs without committing side effects:

Tap ingress — after authentication and policy checks, fork the normalized request (user message, session state pointer, tool context) to a shadow queue.
Sandbox tool layer — shadow runs use stubbed or read-only tool adapters: issue_credit returns a synthetic success without hitting the ledger; database tools run against snapshots or read replicas.
Parallel execution — stable version serves the user; shadow version runs asynchronously with its own timeout budget (often 1.5× stable latency cap).
Diff capture — record trajectory divergence: tool call sequences, final structured outputs, token cost, guardrail triggers, and human-review escalations.

What shadow mode catches that offline eval misses

Live tool latency causing different replanning paths.
Truncated observations from oversized API responses.
Rate-limit interactions between concurrent tenant workloads.
Prompt injection attempts present only in production logs.
Checkpoint resume behavior after mid-run restarts.

Shadow is not free: you pay duplicate inference and queue depth. Cap shadow sampling at 5–20% of traffic for high-volume agents, or run 100% shadow for low-QPS critical paths during a release window. Drop shadow events under backpressure before dropping stable traffic.

Canary traffic splits

Once shadow metrics look acceptable, promote to live canary — real users, real side effects, small slice:

Routing dimensions

Split key	Use when	Risk
Random session ID hash	Default for homogeneous traffic	Low; sticky per session
Tenant ID allowlist	Design partners or internal dogfood	Low; explicit opt-in
Task class / intent	High-risk tools isolated	Medium; uneven load
Geography / region	Regulatory or latency testing	Medium; skewed demographics
User cohort (new vs returning)	Onboarding flows	High; metric comparability suffers

Staged ramp schedule

A typical promotion ladder for a financial agent like Harbor’s:

Stage 0 — 100% shadow, 0% live canary (24–72 h).
Stage 1 — 1% live canary, internal tenants only.
Stage 2 — 5% random session hash, all tenants.
Stage 3 — 25% with automated gate checks every 4 h.
Stage 4 — 100% stable; keep previous version warm for 48 h rollback.

Never ramp on Fridays or before holidays unless on-call coverage is explicit. Freeze ramps when upstream model providers announce maintenance.

Promotion gates and rollback triggers

Automated gates compare canary vs stable over a minimum sample size (e.g. 500 completed runs or 10,000 tool calls). Human approval is required for stages above 25%.

Hard rollback triggers (auto-revert)

Error rate > stable + 2σ for 15 consecutive minutes.
Any guardrail bypass or unauthorized tool invocation on canary.
Duplicate side-effect detection (same idempotency key, two writes).
p99 latency > 2× stable budget.
Cost per successful task > 1.4× stable median.

Soft hold triggers (pause ramp, alert owner)

Task success rate down > 3 percentage points (not stat-sig yet).
Human escalation rate up > 20% vs stable.
Shadow/canary trajectory divergence > 15% on golden intents.
User thumbs-down rate up (if collected).

Rollback must be routing-only when possible: flip the traffic split flag, do not redeploy. In-flight runs started on canary should either complete on canary (if safe) or resume on stable from the last compatible checkpoint. Harbor’s Monday incident failed because v2 checkpoints referenced a tool schema stable could not parse; the fix added forward-compatible state envelopes with schema version fields.

Metrics that matter for agent canaries

HTTP error rate is necessary but insufficient. Track agent-specific signals:

Task completion rate — run reached terminal success state without human takeover.
Tool success rate — per-tool error and timeout rates; catch schema mismatches early.
Trajectory length — step count and token usage; spikes often mean replanning loops.
Side-effect correctness — sample audit of writes (refund amount, ticket status) against policy.
Escalation rate — HITL queue volume; rising escalations signal quality regression before hard failures.
User-visible latency — time to first token and time to task resolution.

Wire these into the same dashboards as offline eval suites so promotion criteria are defined before deploy day, not argued about during an incident.

Harbor Support refactor walkthrough

Harbor’s platform team replaced big-bang deploys with a four-layer release pipeline:

Release registry — immutable artifacts with fingerprint; CI blocks promote if offline eval regression suite fails.
Shadow worker pool — dedicated queue with read-only tool adapters; 10% sample by default, 100% for 48 h before major model changes.
Split controller — session-hash routing with tenant overrides; integrates with existing feature-flag service.
Gate automator — Prometheus queries + custom SQL audits on refund ledger; auto-rollback on duplicate-credit detection.

Outcomes: bad deploy rate 12% → 0.8%; rollback time 4 h → 90 s; shadow cost capped at 8% of total inference spend via sampling and off-peak replay of stored traces.

Technique decision table

Approach	Best for	Weak when
Big-bang deploy	Early prototypes, zero side-effect agents	Any write tool or regulated workflow
Shadow only	High-risk tools; pre-release model swaps	Cannot validate real latency or user satisfaction
Canary only (no shadow)	Low-QPS agents with cheap rollback	First sight of divergence is user-facing
Shadow then canary	Production agents with financial or PII tools	Higher infra cost and pipeline complexity
Tenant allowlist canary	B2B with design-partner tenants	Sample may not represent full fleet
Blue-green (100% swap)	Stateless chat with no tools	Long-running workflows and checkpoint mismatch

Common pitfalls

Shadow tools that lie — stubs returning instant success hide timeout-driven replans; inject realistic latency from stable traces.
Non-sticky canary routing — same session hitting v2 then v3 corrupts memory; hash on session ID.
Promoting on volume, not quality — 10,000 runs with 71% accuracy is worse than 500 runs at 94%; set minimum quality floors.
Ignoring checkpoint compatibility — rollback bricks in-flight runs; version state envelopes from day one.
Canary during prompt injection spike — abnormal traffic poisons comparisons; gate on traffic anomaly detection.
Shadow without tenant isolation — cross-tenant trace leakage in shared shadow pools; namespace queues per tenant tier.

Production checklist

Define immutable release fingerprint (model + prompt + tools + runtime).
Block promote if offline eval regression suite fails in CI.
Implement shadow queue with read-only or stubbed write tools.
Sample shadow traffic; shed shadow before stable under load.
Configure session-sticky canary routing with tenant overrides.
Write hard rollback triggers (error rate, duplicate writes, guardrail bypass).
Write soft hold triggers (escalation rate, success rate delta).
Ensure rollback is routing flip, not redeploy, with warm previous version.
Version checkpoint envelopes for forward-compatible resume.
Dashboard canary vs stable on task success, tool errors, cost, latency.
Document ramp schedule and freeze windows; require human sign-off above 25%.
Run game-day drill: inject bad canary, verify auto-rollback under 2 minutes.

Key takeaways

Shadow traffic tests candidates on real inputs without user risk — essential before live canary.
Canary splits bound blast radius — ramp 1% → 5% → 25% → 100% with automated gates.
Agent rollouts need agent metrics — task success, tool errors, escalations, not just HTTP 5xx.
Rollback must be routing-fast and checkpoint-safe — Harbor cut rollback from 4 hours to 90 seconds.
Pair with eval, tracing, and feature flags — release plumbing is not optional for tool-using agents.