Guide
LLM agent canary deployment and shadow traffic systems explained
Harbor Support shipped a new refund-handling agent on a Friday afternoon:
updated system prompt, a cheaper model tier, and a rewritten
issue_credit tool schema. The deploy script flipped
100% of production traffic in one step. By Monday,
refund approval accuracy had fallen from 94% to 71%,
duplicate credits hit 3.2% of approved runs, and mean
time to resolution stretched 18 minutes. Rollback took
four hours because checkpoint state from the new version was incompatible
with the old binary. After Harbor rebuilt release plumbing around
shadow replay, staged canary splits,
and automated promotion gates, the share of deploys
causing user-visible regressions dropped from 12% to 0.8%
and rollback time fell to under 90 seconds.
Canary deployment routes a small slice of real traffic to a new agent version while the stable version serves everyone else. Shadow traffic runs the candidate against the same inputs as production but discards its outputs — you compare metrics without risking user-facing side effects. Together they form the release layer between offline agent evaluation and full cutover. This guide covers version identity, traffic routing, shadow execution architecture, promotion and rollback gates, metric selection, Harbor’s refactor, a technique decision table, pitfalls, and a production checklist.
Why agents need a different rollout model
Stateless API swaps are easy: flip a load balancer, watch error rates, roll back. Agents are stateful, tool-using, and non-deterministic. A new prompt can pass every offline golden test yet fail on live edge cases. Tool schema changes break idempotent replay. Model swaps alter token budgets and truncation behavior downstream. Multi-step runs span minutes — a mid-run deploy can leave half the conversation on v2 and half on v3.
Big-bang releases optimize for shipping speed until the first bad Friday. Canary and shadow systems trade a little latency and infrastructure cost for bounded blast radius and evidence-based promotion. They pair naturally with feature flags (per-tenant or per-task-class toggles) and distributed tracing so you can attribute metric deltas to a specific agent build.
Version identity and the release artifact
Every deployable agent build needs an immutable release fingerprint propagated through routing, logs, and traces:
- Model route — provider, model ID, temperature, max tokens, and fallback ladder slot.
- Prompt bundle hash — system prompt, tool descriptions, few-shot examples, and policy snippets versioned together.
- Tool manifest revision — JSON Schema per tool, approval gates, and sandbox profile ID.
- Middleware stack version — ordered hooks for guardrails, PII scrubbing, rate limits, and cost caps.
- Runtime binary — orchestrator code that executes the agent loop; must be backward compatible with in-flight checkpoints.
Store the fingerprint on every span and run record. Promotion decisions
compare agent_release=v2026.06.12-a3f9 against
agent_release=v2026.06.05-b1c2 — not “the new
model.” In multi-tenant setups, bind releases per tenant via
tenant context
so a canary for Tenant A never leaks into Tenant B.
Shadow traffic architecture
Shadow mode executes the candidate agent on a copy of production inputs without committing side effects:
- Tap ingress — after authentication and policy checks, fork the normalized request (user message, session state pointer, tool context) to a shadow queue.
- Sandbox tool layer — shadow runs use stubbed
or read-only tool adapters:
issue_creditreturns a synthetic success without hitting the ledger; database tools run against snapshots or read replicas. - Parallel execution — stable version serves the user; shadow version runs asynchronously with its own timeout budget (often 1.5× stable latency cap).
- Diff capture — record trajectory divergence: tool call sequences, final structured outputs, token cost, guardrail triggers, and human-review escalations.
What shadow mode catches that offline eval misses
- Live tool latency causing different replanning paths.
- Truncated observations from oversized API responses.
- Rate-limit interactions between concurrent tenant workloads.
- Prompt injection attempts present only in production logs.
- Checkpoint resume behavior after mid-run restarts.
Shadow is not free: you pay duplicate inference and queue depth. Cap shadow sampling at 5–20% of traffic for high-volume agents, or run 100% shadow for low-QPS critical paths during a release window. Drop shadow events under backpressure before dropping stable traffic.
Canary traffic splits
Once shadow metrics look acceptable, promote to live canary — real users, real side effects, small slice:
Routing dimensions
| Split key | Use when | Risk |
|---|---|---|
| Random session ID hash | Default for homogeneous traffic | Low; sticky per session |
| Tenant ID allowlist | Design partners or internal dogfood | Low; explicit opt-in |
| Task class / intent | High-risk tools isolated | Medium; uneven load |
| Geography / region | Regulatory or latency testing | Medium; skewed demographics |
| User cohort (new vs returning) | Onboarding flows | High; metric comparability suffers |
Staged ramp schedule
A typical promotion ladder for a financial agent like Harbor’s:
- Stage 0 — 100% shadow, 0% live canary (24–72 h).
- Stage 1 — 1% live canary, internal tenants only.
- Stage 2 — 5% random session hash, all tenants.
- Stage 3 — 25% with automated gate checks every 4 h.
- Stage 4 — 100% stable; keep previous version warm for 48 h rollback.
Never ramp on Fridays or before holidays unless on-call coverage is explicit. Freeze ramps when upstream model providers announce maintenance.
Promotion gates and rollback triggers
Automated gates compare canary vs stable over a minimum sample size (e.g. 500 completed runs or 10,000 tool calls). Human approval is required for stages above 25%.
Hard rollback triggers (auto-revert)
- Error rate > stable + 2σ for 15 consecutive minutes.
- Any guardrail bypass or unauthorized tool invocation on canary.
- Duplicate side-effect detection (same idempotency key, two writes).
- p99 latency > 2× stable budget.
- Cost per successful task > 1.4× stable median.
Soft hold triggers (pause ramp, alert owner)
- Task success rate down > 3 percentage points (not stat-sig yet).
- Human escalation rate up > 20% vs stable.
- Shadow/canary trajectory divergence > 15% on golden intents.
- User thumbs-down rate up (if collected).
Rollback must be routing-only when possible: flip the traffic split flag, do not redeploy. In-flight runs started on canary should either complete on canary (if safe) or resume on stable from the last compatible checkpoint. Harbor’s Monday incident failed because v2 checkpoints referenced a tool schema stable could not parse; the fix added forward-compatible state envelopes with schema version fields.
Metrics that matter for agent canaries
HTTP error rate is necessary but insufficient. Track agent-specific signals:
- Task completion rate — run reached terminal success state without human takeover.
- Tool success rate — per-tool error and timeout rates; catch schema mismatches early.
- Trajectory length — step count and token usage; spikes often mean replanning loops.
- Side-effect correctness — sample audit of writes (refund amount, ticket status) against policy.
- Escalation rate — HITL queue volume; rising escalations signal quality regression before hard failures.
- User-visible latency — time to first token and time to task resolution.
Wire these into the same dashboards as offline eval suites so promotion criteria are defined before deploy day, not argued about during an incident.
Harbor Support refactor walkthrough
Harbor’s platform team replaced big-bang deploys with a four-layer release pipeline:
- Release registry — immutable artifacts with fingerprint; CI blocks promote if offline eval regression suite fails.
- Shadow worker pool — dedicated queue with read-only tool adapters; 10% sample by default, 100% for 48 h before major model changes.
- Split controller — session-hash routing with tenant overrides; integrates with existing feature-flag service.
- Gate automator — Prometheus queries + custom SQL audits on refund ledger; auto-rollback on duplicate-credit detection.
Outcomes: bad deploy rate 12% → 0.8%; rollback time 4 h → 90 s; shadow cost capped at 8% of total inference spend via sampling and off-peak replay of stored traces.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Big-bang deploy | Early prototypes, zero side-effect agents | Any write tool or regulated workflow |
| Shadow only | High-risk tools; pre-release model swaps | Cannot validate real latency or user satisfaction |
| Canary only (no shadow) | Low-QPS agents with cheap rollback | First sight of divergence is user-facing |
| Shadow then canary | Production agents with financial or PII tools | Higher infra cost and pipeline complexity |
| Tenant allowlist canary | B2B with design-partner tenants | Sample may not represent full fleet |
| Blue-green (100% swap) | Stateless chat with no tools | Long-running workflows and checkpoint mismatch |
Common pitfalls
- Shadow tools that lie — stubs returning instant success hide timeout-driven replans; inject realistic latency from stable traces.
- Non-sticky canary routing — same session hitting v2 then v3 corrupts memory; hash on session ID.
- Promoting on volume, not quality — 10,000 runs with 71% accuracy is worse than 500 runs at 94%; set minimum quality floors.
- Ignoring checkpoint compatibility — rollback bricks in-flight runs; version state envelopes from day one.
- Canary during prompt injection spike — abnormal traffic poisons comparisons; gate on traffic anomaly detection.
- Shadow without tenant isolation — cross-tenant trace leakage in shared shadow pools; namespace queues per tenant tier.
Production checklist
- Define immutable release fingerprint (model + prompt + tools + runtime).
- Block promote if offline eval regression suite fails in CI.
- Implement shadow queue with read-only or stubbed write tools.
- Sample shadow traffic; shed shadow before stable under load.
- Configure session-sticky canary routing with tenant overrides.
- Write hard rollback triggers (error rate, duplicate writes, guardrail bypass).
- Write soft hold triggers (escalation rate, success rate delta).
- Ensure rollback is routing flip, not redeploy, with warm previous version.
- Version checkpoint envelopes for forward-compatible resume.
- Dashboard canary vs stable on task success, tool errors, cost, latency.
- Document ramp schedule and freeze windows; require human sign-off above 25%.
- Run game-day drill: inject bad canary, verify auto-rollback under 2 minutes.
Key takeaways
- Shadow traffic tests candidates on real inputs without user risk — essential before live canary.
- Canary splits bound blast radius — ramp 1% → 5% → 25% → 100% with automated gates.
- Agent rollouts need agent metrics — task success, tool errors, escalations, not just HTTP 5xx.
- Rollback must be routing-fast and checkpoint-safe — Harbor cut rollback from 4 hours to 90 seconds.
- Pair with eval, tracing, and feature flags — release plumbing is not optional for tool-using agents.
Related reading
- LLM agent evaluation and benchmarking explained — offline regression suites that gate promotion
- LLM agent observability and tracing explained — spans and dashboards for canary vs stable
- LLM agent model fallback and graceful degradation explained — runtime resilience after cutover
- Feature flags explained — kill switches and gradual rollouts for agent routes