Guide

LLM agent feature flag and dynamic configuration systems explained

Harbor Platform, a multi-tenant B2B agent stack, ran twelve production services each reading its own environment variables for “feature toggles”: whether the refund tool was enabled, which embedding model powered retrieval, whether reflection loops ran, and the default fallback model ladder. When a partner's CRM webhook started returning 500s, the refund tool poisoned 46% of runs in that tenant's slice. On-call engineers needed a coordinated redeploy across four repos to flip REFUND_TOOL_ENABLED=false. Median time-to-mitigate was 23 minutes; two tenants churned during the window. Worse, staging and production env files had drifted — one service still had the tool enabled after the “fix” deploy because its chart used a different key name.

The remediation was a centralized dynamic configuration plane: versioned flag snapshots, layered overrides (global, tenant, route), sub-second propagation to workers, and audit-logged kill switches that apply on the next run boundary without redeploying binaries. Median mitigation dropped to under 90 seconds; repeat misconfiguration incidents fell from 46% of sev-2 pages to 5.2%. This guide explains what agent feature flags are, how they differ from prompt versioning and canary deploys, resolution order, evaluation timing, config versioning, the Harbor Platform refactor, a decision table, pitfalls, and a production checklist.

What feature flags do in agent systems

A feature flag system for LLM agents answers: which capabilities, models, tools, and policy knobs are active for this run right now? It is the runtime control plane for behavior that changes more often than application binaries.

Typical flag categories:

Tool gates — enable/disable individual tools or entire tool families (writes, external HTTP, code execution).
Model routing — primary model, fallback ladder depth, max output tokens, reasoning mode on/off.
Retrieval and memory — RAG index version, hybrid search on/off, episodic memory write permissions.
Agent loop behavior — max tool iterations, reflection loops, parallel tool cap, human-in-the-loop requirements.
Degradation and cost — semantic cache on/off, batching mode, SLA profile selection.
UX and channel — streaming on/off, attachment ingestion, voice channel features.

Flags are not a substitute for prompt template versioning (instruction text) or binary canary deploys (runtime code). They complement both: templates define what to say; code defines how to execute; flags define which paths are allowed for this tenant today.

Configuration layering and resolution order

Production systems resolve config through a deterministic stack. Later layers override earlier ones; the resolver logs the winning value and provenance on every run:

Platform defaults — safe baseline baked into the config service schema.
Environment slice — staging vs production namespace (never share one flat file).
Global live snapshot — versioned bundle promoted by ops; all tenants inherit unless overridden.
Tenant overlay — per-customer contract limits (tool allowlists, data residency, model tier).
Route or agent profile — support-triage vs contract-review may differ.
Experiment bucket — sticky hash assignment for A/B flags tied to canary metrics.
Emergency override — break-glass kill switch with TTL and mandatory audit reason.

resolved = merge(
  platform_defaults,
  env_slice,
  snapshot@rev_1842,
  tenant_overlay[acme_corp],
  route_profile["refund-assistant"],
  experiment_bucket(user_id, "reflection_v2"),
  emergency_override if active
)
trace.set("config.snapshot_rev", 1842)
trace.set("config.provenance", resolved.audit_trail)

Resolution must be pure and deterministic for a given (snapshot_rev, tenant_id, route, user_id, timestamp). Nondeterministic “latest from cache” without a revision id makes incident replay impossible.

Evaluation timing: when flags bind to a run

Three binding strategies matter for correctness:

Run-start snapshot (recommended default)

At start_run, resolve all flags into an immutable RunConfig attached to the run id. Mid-run config changes do not affect in-flight trajectories — avoiding half-old / half-new tool sets. New runs pick up the updated snapshot within the propagation SLA (target < 5 s).

Step-boundary refresh

Re-resolve selected flags between orchestrator steps (e.g. before each tool batch). Use sparingly for long-running workflows where kill switches must land faster than run duration. Document which flags are hot_reloadable vs run_pinned.

Process-boot only (anti-pattern)

Reading env vars once at worker import was Harbor's failure mode. Pools lived for hours; toggles appeared flipped in dashboards while workers still executed stale config. Ban this pattern for any customer-visible behavior.

Pair run-start snapshots with checkpoint persistence so resumed runs rehydrate the same RunConfig, not the latest global snapshot.

Versioned snapshots and safe rollout

Treat config changes like schema migrations:

Every publish creates snapshot_rev N with diff, author, and rollback pointer to N-1.
Staging validates against synthetic runs before production promotion.
Gradual promotion: 1% tenants → 10% → 100% with automatic rollback if error rates spike.
Flags that affect billing or compliance require signed approval in the change log.

Flag keys should be typed (boolean, enum, integer bounded, JSON with schema). Free-form strings in production invite typos like Harbor's REFUND_TOOL_ENABLED vs ENABLE_REFUND_TOOL split across services. A single registry enforces names, defaults, and deprecation windows.

Integrate with tenant isolation: tenant overlays live in separate namespaces; a global kill switch must not accidentally wipe a tenant's required-on compliance flag without an explicit override hierarchy rule.

Kill switches and incident playbooks

Agent incidents often need surgical degradation, not full outage:

Incident	Flag action	User-visible effect
Poisoned external tool	`tools.refund_api.enabled=false`	Agent explains refund unavailable; escalates to human
Embedding index corrupt	`rag.hybrid_search=false`; pin prior index rev	Keyword-only retrieval; banner in internal dash
Model provider outage	Deepen fallback ladder; disable reflection	Shorter answers; higher cache hit rate
Runaway cost spike	`loop.max_iterations=3`; disable parallel tools	Cheaper trajectories; queue wait may rise
Prompt injection campaign	Require HITL on write tools globally	Slower writes; blocks automated exfil

Kill switches need: one-click UI, mandatory reason code, automatic TTL (e.g. 4 h unless renewed), Slack/PagerDuty audit event, and a metric panel showing active overrides. On-call runbooks link flag keys to dashboards — not to Helm chart line numbers.

Harbor Platform refactor

Root causes beyond scattered env vars:

No single revision id — impossible to answer which config produced a bad run.
Cross-service key drift — twelve naming conventions for the same concept.
Worker boot pinning — toggles changed in k8s secrets but not in long-lived pods.
Tenant conflation — global disable broke a tenant that contractually required write tools.

Shipped fixes:

Central config service with snapshot_rev, gRPC watch stream to workers.
Run-start RunConfig blob stored on every trace root span.
Tenant overlay UI for CS and solutions engineers (no deploy rights needed).
Emergency kill switch with 60 s propagation SLO and auto-expire.
CI guard: services cannot read new env keys; only the config client library.

Median incident mitigation fell from 23 minutes to 87 seconds. Misconfiguration-related sev-2 pages dropped from 46% to 5.2% of total agent incidents over two quarters.

Technique decision table

Approach	Best for	Weak when
Env vars per service	Early prototypes, single binary	Multi-tenant agents, frequent toggles, incidents
Config file in object storage	Infrequent changes, GitOps workflow	Sub-minute kill switches without redeploy
Versioned snapshot + watch stream	Production agents, tenant overlays	Teams unwilling to operate a config service
Prompt template registry alone	Instruction changes, semver rollouts	Disabling a tool or model route instantly
Binary canary deploy	Runtime code changes, new tool implementations	Flipping retrieval index or policy knob
Run-pinned snapshot	Consistent multi-step trajectories	Need mid-run kill switch within seconds

Common pitfalls

Flags without revision ids — cannot debug or replay incidents.
Boot-time-only reads — dashboards lie; workers run stale config for hours.
Mid-run mixed config — step 1 uses new model, step 3 uses old tool set; chaotic failures.
Untyped string flags — "flase" enables paths you thought were off.
Global kill switch collateral — disables compliance-required behavior for all tenants.
Flag sprawl without ownership — 200 keys nobody dares delete; fear of unknown dependencies.
No linkage to observability — metrics sliced by snapshot_rev missing; canaries blind.

Production checklist

Define a typed flag registry with owners, defaults, and deprecation policy.
Publish versioned snapshots; log snapshot_rev on every run root span.
Resolve run-start RunConfig; pin on checkpoint resume.
Support tenant and route overlays without per-tenant binary builds.
Implement emergency kill switches with TTL, reason codes, and audit events.
Propagate config changes to workers in < 5 s via watch stream or poll.
Ban direct env-var feature toggles in application code (lint in CI).
Document which flags are hot_reloadable vs run-pinned.
Integrate snapshot promotion with canary error-rate gates.
Run game-day drills: disable a tool under load and measure mitigation time.
Expose read-only config diff in support UI for CS troubleshooting.

Key takeaways

Agent feature flags are the runtime control plane for tools, models, and policy knobs.
Versioned snapshots with revision ids make config changes auditable and replayable.
Run-start pinning prevents mid-trajectory inconsistency; hot reload only where explicitly allowed.
Kill switches must propagate in seconds, not redeploy minutes.
Harbor Platform cut misconfiguration incidents from 46% to 5.2% with centralized dynamic config.