Guide

LLM prompt versioning and registry explained

Harbor Legal's contract-review copilot shipped a “tiny” system-prompt tweak on a Friday: one sentence telling the model to “be concise.” By Monday, indemnity-clause extraction accuracy dropped from 91% to 74% on the nightly golden set — but nobody noticed until a paralegal escalated three missed carve-outs. The change lived in a pull request buried inside a refactor; production had no prompt ID in traces, no pinned version in config, and no eval gate between merge and deploy. Rollback took four hours because three services each cached a different string from environment variables. The fix was not more prompt engineering talent; it was a prompt registry with immutable versions, environment pins, and promotion only after offline evaluation cleared.

Prompt versioning treats instructions, templates, few-shot packs, and tool schemas as first-class release artifacts — not copy-paste strings in application code. A registry stores each version with metadata (author, model family, eval scores, changelog) and lets runtime resolve contract-review@2.4.1 instead of whatever was last edited. This guide covers artifact taxonomy, registry architectures, semver and environment pins, eval gates tied to canary rollouts, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.

What belongs in a prompt registry

Teams that version only the system prompt still get surprised when other instruction surfaces drift. A complete registry tracks every text or schema the model sees:

Artifact type Examples Version independently?
System / developer instructions Role, tone, safety policy, output format rules Yes — highest blast radius
Task templates Jinja-style user prompts with {{variables}} Yes — per workflow or locale
Few-shot exemplars Input/output pairs, chain-of-thought demos Yes — often changes more than system text
Tool and function schemas JSON Schema for function calling Yes — coupled to parser expectations
Retrieval snippets Static RAG chunks injected every turn Yes — or tie to index version
Decoding defaults temperature, max_tokens, stop sequences Often bundled as “prompt profile”

Bundle related artifacts into a prompt profile (e.g. harbor-legal-contract-review) while keeping each component addressable. That lets you bump few-shots without touching system instructions, and compare traces that log profile=contract-review version=2.4.1 component=system.

Registry architectures: git, database, and vendor platforms

Three patterns dominate production. Most mature teams combine git for review with a runtime store for low-latency fetch and audit:

Git-as-source-of-truth

Prompts live in YAML or Markdown under prompts/contract-review/v2.4.1/. Pull requests carry diffs humans can read; CI runs eval suites before merge. Pros: familiar workflow, free history, easy rollback via revert. Cons: runtime must pull or sync on deploy — not ideal for non-engineers editing copy without a PR, unless you add a UI that commits back to git.

Database or object-store registry

Each publish writes an immutable row: id, semver, content_hash, blob_url, eval_run_id, promoted_at. Services resolve version at startup or via a sidecar that watches for updates. Pros: dynamic rollback without redeploy, fine-grained ACLs for legal/compliance teams. Cons: you must build review UI and prevent unversioned overwrites.

Vendor prompt platforms (LangSmith, Humanloop, etc.)

Hosted registries with playground, A/B labels, and trace linkage. Pros: fast to adopt, integrated with observability. Cons: vendor lock-in, export discipline for disaster recovery, and cost at scale.

Regardless of backend, enforce immutable versions: publishing 2.4.2 never mutates 2.4.1's blob. Fixes ship as new semver, not silent edits.

Version semantics and environment pins

Use semantic versioning with team-defined meaning:

  • Major — breaking output schema, new required tools, or model family change (re-run full eval + shadow).
  • Minor — instruction changes that may shift quality distribution (re-run task evals, canary before full promote).
  • Patch — typo fixes, whitespace, comments with no behavioral intent (still log; optional smoke eval).

Environment config pins explicit versions — never latest in production:

# production.env
PROMPT_CONTRACT_REVIEW=harbor-legal/contract-review@2.4.1
PROMPT_SUPPORT_TRIAGE=harbor-support/triage@1.8.0

# staging.env — may trail prod or test candidates
PROMPT_CONTRACT_REVIEW=harbor-legal/contract-review@2.5.0-rc.2

Staging runs candidate versions against production traffic mirrors or shadow paths. Promotion updates the pin only after gates pass — same pattern as canary deployment, but scoped to prompt artifacts without swapping model weights.

Evaluation gates before promotion

A registry without eval gates becomes a fancy graveyard of untested strings. Minimum pipeline for any minor or major prompt bump:

  1. Golden set regression — fixed inputs with expected structured fields or rubric scores; fail CI if accuracy drops more than agreed delta.
  2. LLM-as-judge on stratified sample — compare vN vs vN+1 on dimensions defined in prompt specs (completeness, citation fidelity, tone).
  3. Safety replay — red-team probes and PII leakage cases must not regress; guardrail trigger rates within baseline.
  4. Cost and latency smoke — longer instructions or few-shots change token counts; check p95 latency on representative contexts.
  5. Human sign-off for regulated flows — legal, medical, or financial prompts may require named approver on the registry record.

Store eval_run_id on the registry entry so incident response can answer “what did we know before we shipped this?” Traces in production should include prompt_version alongside model and request_id.

Harbor Legal refactor: from env vars to pinned profiles

After the Friday incident, Harbor Legal moved contract review to a git-backed registry synced to S3 on deploy:

  1. Split artifacts — system policy, clause-extraction template, indemnity few-shots, and JSON schema each got independent semver under prompts/harbor-legal/.
  2. CI eval on every PR — 240 golden contracts plus 40 adversarial edge cases; merge blocked if extraction F1 dropped >1.5 points.
  3. Runtime resolver — a 20-line library loads pinned versions, renders templates, and logs content_hash to OpenTelemetry spans.
  4. Rollback drill — on-call can repoint prod pin to previous semver in under two minutes without rebuilding containers.
  5. Shadow new minors — candidate prompts run in shadow for 72 hours on 10% of traffic before pin update.

The “be concise” regression would have failed the golden set: shorter answers omitted required carve_out_list fields in structured output validation. Eval gates turned a silent quality cliff into a blocked merge.

Technique decision table

Approach Use when Avoid when
Hard-coded strings in app code Throwaway prototypes, single developer, no compliance needs Any multi-env production stack with more than one editor
Git-only registry + CI eval Engineering-owned prompts, strong PR culture, infra already in git Non-technical stakeholders must edit copy daily without PRs
DB registry with admin UI Ops/legal need self-serve edits, dynamic rollback, audit trails Team lacks discipline on immutability and eval automation
Vendor prompt platform Fast MVP, integrated tracing, small team without registry build capacity Strict data residency or long-term vendor independence requirements
Prompt profile bundles Multiple artifacts must move together for a given workflow You need to A/B test single sentence changes in isolation
Feature flags for prompt version Gradual canary by user cohort alongside registry pins Only flag exists with no version history or eval record

Common pitfalls

  • Mutating “the same” version — destroys reproducibility; always publish new semver.
  • latest in production — deploy order races make incidents non-deterministic.
  • Versioning prompts but not few-shots — exemplar edits move metrics as much as system text.
  • No prompt ID in traces — impossible to correlate user complaints with a release.
  • Eval only on merge, never on promote — staging pin can drift from tested artifact if sync fails.
  • Cross-model prompt reuse without re-eval — GPT-4 and Claude respond differently to identical instructions.
  • Secrets in prompt blobs — API examples in few-shots leak into logs and vendor training opt-outs.
  • Registry bypass via hotfix env var — emergency patches skip audit; use break-glass pin with ticket ID instead.

Production checklist

  • Inventory all prompt artifacts per workflow (system, templates, few-shots, tools).
  • Enforce immutable semver on every publish; block in-place overwrites.
  • Pin explicit versions per environment; ban latest in prod config.
  • Run golden-set regression in CI on any prompt-affecting PR.
  • Attach eval_run_id and approver to registry metadata.
  • Log prompt_profile, prompt_version, and content_hash on every LLM span.
  • Document rollback: repoint pin or revert git tag within SLO.
  • Shadow or canary minor+ bumps before full traffic promotion.
  • Re-eval when changing model family, even if prompt text is unchanged.
  • Export registry snapshots for disaster recovery if using hosted platform.
  • Separate staging candidates from production pins with clear promotion workflow.

Key takeaways

  • Prompts are release artifacts — version them like code, not like copy docs.
  • Immutable semver plus environment pins make rollback minutes, not hours.
  • Eval gates on registry promotion catch regressions that code review misses.
  • Traces must include prompt version IDs or incidents stay unexplainable.
  • Bundle system, few-shots, and schemas — but version each component independently.

Related reading