Guide
LLM prompt versioning and registry explained
Harbor Legal's contract-review copilot shipped a “tiny” system-prompt tweak on a Friday: one sentence telling the model to “be concise.” By Monday, indemnity-clause extraction accuracy dropped from 91% to 74% on the nightly golden set — but nobody noticed until a paralegal escalated three missed carve-outs. The change lived in a pull request buried inside a refactor; production had no prompt ID in traces, no pinned version in config, and no eval gate between merge and deploy. Rollback took four hours because three services each cached a different string from environment variables. The fix was not more prompt engineering talent; it was a prompt registry with immutable versions, environment pins, and promotion only after offline evaluation cleared.
Prompt versioning treats instructions, templates, few-shot packs, and
tool schemas as first-class release artifacts — not copy-paste strings in application
code. A registry stores each version with metadata (author, model family, eval scores,
changelog) and lets runtime resolve contract-review@2.4.1 instead of
whatever was last edited. This guide covers artifact taxonomy, registry architectures,
semver and environment pins, eval gates tied to
canary rollouts,
the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.
What belongs in a prompt registry
Teams that version only the system prompt still get surprised when other instruction surfaces drift. A complete registry tracks every text or schema the model sees:
| Artifact type | Examples | Version independently? |
|---|---|---|
| System / developer instructions | Role, tone, safety policy, output format rules | Yes — highest blast radius |
| Task templates | Jinja-style user prompts with {{variables}} |
Yes — per workflow or locale |
| Few-shot exemplars | Input/output pairs, chain-of-thought demos | Yes — often changes more than system text |
| Tool and function schemas | JSON Schema for function calling | Yes — coupled to parser expectations |
| Retrieval snippets | Static RAG chunks injected every turn | Yes — or tie to index version |
| Decoding defaults | temperature, max_tokens, stop sequences | Often bundled as “prompt profile” |
Bundle related artifacts into a prompt profile (e.g.
harbor-legal-contract-review) while keeping each component addressable.
That lets you bump few-shots without touching system instructions, and compare traces
that log profile=contract-review version=2.4.1 component=system.
Registry architectures: git, database, and vendor platforms
Three patterns dominate production. Most mature teams combine git for review with a runtime store for low-latency fetch and audit:
Git-as-source-of-truth
Prompts live in YAML or Markdown under prompts/contract-review/v2.4.1/.
Pull requests carry diffs humans can read; CI runs eval suites before merge. Pros:
familiar workflow, free history, easy rollback via revert. Cons: runtime must pull or
sync on deploy — not ideal for non-engineers editing copy without a PR, unless
you add a UI that commits back to git.
Database or object-store registry
Each publish writes an immutable row: id, semver, content_hash, blob_url,
eval_run_id, promoted_at. Services resolve version at startup or via a sidecar
that watches for updates. Pros: dynamic rollback without redeploy, fine-grained ACLs
for legal/compliance teams. Cons: you must build review UI and prevent unversioned
overwrites.
Vendor prompt platforms (LangSmith, Humanloop, etc.)
Hosted registries with playground, A/B labels, and trace linkage. Pros: fast to adopt, integrated with observability. Cons: vendor lock-in, export discipline for disaster recovery, and cost at scale.
Regardless of backend, enforce immutable versions: publishing
2.4.2 never mutates 2.4.1's blob. Fixes ship as new
semver, not silent edits.
Version semantics and environment pins
Use semantic versioning with team-defined meaning:
- Major — breaking output schema, new required tools, or model family change (re-run full eval + shadow).
- Minor — instruction changes that may shift quality distribution (re-run task evals, canary before full promote).
- Patch — typo fixes, whitespace, comments with no behavioral intent (still log; optional smoke eval).
Environment config pins explicit versions — never latest in production:
# production.env
PROMPT_CONTRACT_REVIEW=harbor-legal/contract-review@2.4.1
PROMPT_SUPPORT_TRIAGE=harbor-support/triage@1.8.0
# staging.env — may trail prod or test candidates
PROMPT_CONTRACT_REVIEW=harbor-legal/contract-review@2.5.0-rc.2
Staging runs candidate versions against production traffic mirrors or shadow paths. Promotion updates the pin only after gates pass — same pattern as canary deployment, but scoped to prompt artifacts without swapping model weights.
Evaluation gates before promotion
A registry without eval gates becomes a fancy graveyard of untested strings. Minimum pipeline for any minor or major prompt bump:
- Golden set regression — fixed inputs with expected structured fields or rubric scores; fail CI if accuracy drops more than agreed delta.
- LLM-as-judge on stratified sample — compare vN vs vN+1 on dimensions defined in prompt specs (completeness, citation fidelity, tone).
- Safety replay — red-team probes and PII leakage cases must not regress; guardrail trigger rates within baseline.
- Cost and latency smoke — longer instructions or few-shots change token counts; check p95 latency on representative contexts.
- Human sign-off for regulated flows — legal, medical, or financial prompts may require named approver on the registry record.
Store eval_run_id on the registry entry so incident response can answer
“what did we know before we shipped this?” Traces in production should
include prompt_version alongside model and request_id.
Harbor Legal refactor: from env vars to pinned profiles
After the Friday incident, Harbor Legal moved contract review to a git-backed registry synced to S3 on deploy:
- Split artifacts — system policy, clause-extraction template,
indemnity few-shots, and JSON schema each got independent semver under
prompts/harbor-legal/. - CI eval on every PR — 240 golden contracts plus 40 adversarial edge cases; merge blocked if extraction F1 dropped >1.5 points.
- Runtime resolver — a 20-line library loads pinned versions,
renders templates, and logs
content_hashto OpenTelemetry spans. - Rollback drill — on-call can repoint prod pin to previous semver in under two minutes without rebuilding containers.
- Shadow new minors — candidate prompts run in shadow for 72 hours on 10% of traffic before pin update.
The “be concise” regression would have failed the golden set: shorter answers
omitted required carve_out_list fields in
structured output
validation. Eval gates turned a silent quality cliff into a blocked merge.
Technique decision table
| Approach | Use when | Avoid when |
|---|---|---|
| Hard-coded strings in app code | Throwaway prototypes, single developer, no compliance needs | Any multi-env production stack with more than one editor |
| Git-only registry + CI eval | Engineering-owned prompts, strong PR culture, infra already in git | Non-technical stakeholders must edit copy daily without PRs |
| DB registry with admin UI | Ops/legal need self-serve edits, dynamic rollback, audit trails | Team lacks discipline on immutability and eval automation |
| Vendor prompt platform | Fast MVP, integrated tracing, small team without registry build capacity | Strict data residency or long-term vendor independence requirements |
| Prompt profile bundles | Multiple artifacts must move together for a given workflow | You need to A/B test single sentence changes in isolation |
| Feature flags for prompt version | Gradual canary by user cohort alongside registry pins | Only flag exists with no version history or eval record |
Common pitfalls
- Mutating “the same” version — destroys reproducibility; always publish new semver.
latestin production — deploy order races make incidents non-deterministic.- Versioning prompts but not few-shots — exemplar edits move metrics as much as system text.
- No prompt ID in traces — impossible to correlate user complaints with a release.
- Eval only on merge, never on promote — staging pin can drift from tested artifact if sync fails.
- Cross-model prompt reuse without re-eval — GPT-4 and Claude respond differently to identical instructions.
- Secrets in prompt blobs — API examples in few-shots leak into logs and vendor training opt-outs.
- Registry bypass via hotfix env var — emergency patches skip audit; use break-glass pin with ticket ID instead.
Production checklist
- Inventory all prompt artifacts per workflow (system, templates, few-shots, tools).
- Enforce immutable semver on every publish; block in-place overwrites.
- Pin explicit versions per environment; ban
latestin prod config. - Run golden-set regression in CI on any prompt-affecting PR.
- Attach
eval_run_idand approver to registry metadata. - Log
prompt_profile,prompt_version, andcontent_hashon every LLM span. - Document rollback: repoint pin or revert git tag within SLO.
- Shadow or canary minor+ bumps before full traffic promotion.
- Re-eval when changing model family, even if prompt text is unchanged.
- Export registry snapshots for disaster recovery if using hosted platform.
- Separate staging candidates from production pins with clear promotion workflow.
Key takeaways
- Prompts are release artifacts — version them like code, not like copy docs.
- Immutable semver plus environment pins make rollback minutes, not hours.
- Eval gates on registry promotion catch regressions that code review misses.
- Traces must include prompt version IDs or incidents stay unexplainable.
- Bundle system, few-shots, and schemas — but version each component independently.
Related reading
- Prompt engineering explained — instruction design before you version it
- LLM canary and shadow deployment explained — safe promotion after registry eval passes
- LLM evaluation and benchmarking explained — golden sets and regression gates
- LLM observability explained — tracing prompt version in production spans