Guide

LLM canary and shadow deployment explained

Harbor Support swapped its ticket-triage model on a Friday evening after a strong offline benchmark. By Monday, first-contact resolution had dropped 14%, average handle time rose 22%, and three policy-violating auto-replies reached customers before anyone noticed. The new model was not “worse” in aggregate — it failed on long, multi-intent threads that the eval set barely sampled. Rollback took 40 minutes because prompts, retrieval indexes, and routing weights were bundled in one opaque release artifact. The rebuild introduced staged rollouts: two weeks of shadow inference on 100% of traffic, then a 2% canary with automated promotion gates tied to resolution rate and safety flags. The next model upgrade shipped with zero customer-visible incidents.

Canary deployment routes a small slice of live traffic to a candidate stack while the majority stays on production. Shadow deployment runs the candidate in parallel without serving its output to users, comparing quality, latency, and cost offline. Together with blue-green cutovers and versioned prompts, these patterns limit blast radius when models, prompts, tools, or retrieval indexes change. This guide covers traffic splitting, promotion gates, observability hooks from LLM observability, ties to offline evaluation and uncertainty routing, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why LLM rollouts are not ordinary software deploys

Traditional services fail loudly: exceptions, 500s, crashed pods. LLM stacks often fail quietly: plausible wrong answers, subtle tone drift, higher token burn, or retrieval regressions that benchmarks miss. A release bundle typically includes more than weights:

Model checkpoint — provider version, quantization, context length, tool-calling behavior.
System and task prompts — instruction changes move metrics as much as model swaps.
Retrieval and tools — embedding model, chunking, reranker, and MCP servers are part of the user-visible stack.
Decoding and routing — temperature, max tokens, model routing tiers, and guardrail pipelines.

Treat each axis as a versioned artifact with independent rollback. Bundling everything into “v2.3” makes incidents harder to diagnose and reversions slower than they need to be.

Shadow mode: observe before you expose

In shadow inference, production serves users from stack A while stack B processes duplicate requests asynchronously. Log both outputs, latency, token counts, retrieval hits, tool traces, and safety classifier scores. Users never see B until you promote it.

What to compare in shadow

Task quality — automated judges, rubric scores, structured field agreement, human spot audits on stratified samples.
Safety and policy — guardrail triggers, PII leaks, refusal rates on red-team probes replayed through shadow.
Operational cost — p50/p95 latency, tokens per request, GPU utilization, cache hit rate.
Downstream actions — for agents, compare tool-call sequences and whether shadow would have taken irreversible steps.

Shadow is ideal when mistakes are high stakes (finance, healthcare, public replies) or when you need a full week of live traffic shape before any canary. Cost doubles inference for shadowed requests — sample by hash if budget is tight, but keep enough volume on edge cases (long threads, non-English, empty retrieval).

Canary deployment: controlled live exposure

After shadow looks acceptable, route canary traffic — often 1–5% initially — to stack B. Real users receive B's output; the rest stay on A. Implement splitting at the gateway with sticky session keys so a single conversation does not flip models mid-thread unless you intentionally reset context.

Promotion gates (automated hold / promote)

Define numeric gates before you start the canary, not after metrics move:

Guardrail metrics — auto-hold if safety violation rate, abstention spike, or human-override rate exceeds baseline by a fixed delta.
Quality proxies — resolution rate, thumbs-down rate, judge score, or calibrated uncertainty from confidence routing.
SLO metrics — p95 latency and error rate within budget; token cost per successful task not more than X% above control.
Minimum sample size — do not promote on 200 requests; pre-register N and maximum canary duration.

Ramp 2% → 10% → 50% → 100% only when each stage clears gates. Instant rollback means repointing traffic to A without redeploying — keep A's pods warm during canary.

Blue-green and prompt versioning

Blue-green maintains two full environments. Green is idle until validated; a router switch moves 100% traffic. For LLMs, blue-green pairs well with provider-side model aliases (e.g. gpt-4o-2024-08-06 vs a dated successor) and with pinned container images for self-hosted inference. The switch should be one config change, not a rebuild.

Prompt versioning stores system prompts in a registry with immutable IDs. Canary references prompt_id=support-triage@3 while production stays on @2. Same for retrieval index snapshots. That lets you roll back a bad prompt without touching weights.

Harbor Support refactor: staged triage rollout

Harbor rebuilt releases around four artifacts: model alias, prompt ID, retrieval snapshot, and guardrail config. The rollout pipeline:

Offline regression — 1,800 labeled tickets plus safety suite; block if any hard fail.
Shadow (7 days) — 100% duplicate inference; nightly diff report on category labels, suggested macros, and escalation decisions vs production.
Canary 2% / 48h — sticky by ticket_id; hold if resolution rate drops >3% relative or policy flag rate rises.
Ramp — 10%, 25%, 100% with the same gates; rollback is router repoint under 60 seconds.

Incidents dropped from “weekend surprise” to zero over four subsequent model bumps. The expensive lesson: offline eval is necessary but not sufficient; live traffic shape and human override behavior only appear under canary.

Technique decision table

Technique	Best when	Weak when
Shadow inference	High stakes, need full traffic replay, compare before any user exposure	Latency-sensitive realtime voice; tight GPU budget without sampling
Canary with gates	Production quality metrics exist; gradual risk tolerance	Very low traffic (insufficient sample); no rollback path
Blue-green switch	Self-hosted stacks; instant 100% cutover after validation	Cannot keep dual stacks warm; provider only offers rolling deploy
Classic A/B test	Product experiments with long horizons and causal readouts	Need fast rollback on safety regressions; confounded prompt changes
Direct cutover	Internal tools, low stakes, tiny user base	Customer-facing agents, compliance, or irreversible tool actions
Offline-only eval	Pre-filtering obvious regressions before any live phase	Substitute for live canary on distribution shift and edge cases

Common pitfalls

Non-sticky canaries — users see inconsistent answers mid-conversation; metrics become noise.
Promoting on vanity metrics — shorter replies can look “faster” while resolution rate falls.
Shadow without logging retrieval — you compare outputs but miss that B retrieves worse chunks.
No automatic hold — humans must notice regressions; use alerts on gate breaches.
Canary on employees only — dogfood skews easy; include real customer traffic shape.
Irreversible canary actions — agents that send email or charge cards need shadow or dry-run tool modes first.
Version soup — cannot tell whether prompt or model caused the delta; version each artifact separately.

Production checklist

Version model, prompt, retrieval index, tools, and guardrails independently.
Run offline eval and safety suites; block release on hard failures.
Enable shadow inference with full trace logging before any user-facing canary.
Define promotion gates (quality, safety, latency, cost) with pre-registered thresholds.
Use sticky routing keys per session or conversation.
Keep production stack warm for sub-minute rollback.
Alert on gate breaches; auto-hold canary when safety metrics spike.
Ramp traffic in stages (2% → 10% → 50% → 100%).
Audit a stratified sample of canary outputs daily during rollout.
Document each promotion with artifact IDs and metric readouts in the change log.

Key takeaways

LLM releases fail quietly — offline benchmarks alone do not catch production distribution shift.
Shadow mode compares full stacks on live traffic before any user sees the candidate.
Canary promotion needs pre-defined gates, sticky sessions, and instant rollback to the incumbent stack.
Version prompts, retrieval, and models separately so you know what to revert.
Harbor Support eliminated weekend incidents by shadowing seven days, then ramping a gated 2% canary.