Guide
LLM agent run audit trail and compliance logging explained
Harbor Finance deployed a month-end close agent that reconciled sub-ledgers and
posted adjusting journal entries through NetSuite. During a SOX walkthrough, auditors
asked for evidence that a $4.2M debit to accrued liabilities was
authorized. OpenTelemetry traces showed the agent called
post_journal_entry in 1.8 seconds — but spans do not record
who approved a mutating action, which policy version gated it, or
what the ledger looked like before and after. Internal controls flagged a
material weakness; remediation took eleven days of manual log archaeology across
three systems.
Agent audit trails are append-only, tamper-evident event streams
designed for regulators, security teams, and legal — not for debugging latency.
They differ from
distributed traces
(ephemeral, sampled, engineer-oriented) and from chat transcripts (mutable, incomplete,
often missing tool side effects). Harbor split concerns: traces for SRE, a dedicated
agent_audit store for compliance. Each mutating tool call now emits a
signed event with actor, policy hash, approval attestation, argument digest, and
before/after state references. SOX remediation on the next finding dropped from
11 days to 6 hours because auditors could replay
the run timeline without engineering access. This guide covers audit event schemas,
attestation chains, redaction, retention, the Harbor Finance refactor, a technique
decision table, pitfalls, and a production checklist.
Audit logs vs observability traces
Teams often point auditors at the same Jaeger or Datadog deployment engineers use. That fails compliance reviews for predictable reasons.
What traces optimize for
- Latency and error rates — parent/child span trees, percentile dashboards
- Sampling — 1–10% of traffic is normal; auditors need 100% of mutating actions
- Short retention — 7–30 days is typical; SOX and GDPR often require years
- Mutable backends — re-indexing and TTL deletes are features, not bugs
What audit trails must prove
- Non-repudiation — the event happened; the log was not silently edited afterward
- Actor identity — human user, service account, or delegated subagent with scope
- Policy context — which permission manifest and approval gate version applied
- Causality — ordered chain from user intent through model turn to side effect
- Reconstructability — enough payload to explain the decision without raw secrets
Keep both systems. Export a correlation ID from audit events into traces so engineers can jump from a compliance ticket to a flame graph — but never treat traces as the system of record for regulated mutations.
Immutable event schema
Design one canonical JSON (or Protobuf) envelope every agent runtime emits. Version the schema; never break readers on old runs.
Core fields
event_id— UUIDv7 or ULID for time-sortable uniquenessrun_id— ties to durable execution checkpointstenant_id/environment— prod vs staging isolationevent_type—run.started,tool.invoked,tool.completed,approval.granted,run.terminalactor—{ type: human|agent|system, id, session_id }timestamp_utc— server clock; includeclient_timestamponly as hintpolicy_hash— SHA-256 of the active capability manifest at invoke timeprev_event_hash— hash chain for tamper detection within a run
Tool invocation payload
Log digests, not raw secrets. Store args_sha256 plus a
redacted preview (account_id: ****4821). For financial or healthcare
tools, attach state_before_ref and state_after_ref pointers
to immutable object storage (WORM bucket or ledger table) rather than inline blobs.
Terminal events
Every run ends with run.succeeded, run.failed,
run.cancelled, or run.timed_out — mirroring the FSM in
cancellation lifecycle
guides. Include side_effect_count, mutating_tool_count, and
total_cost_usd for chargeback and anomaly detection.
Approval attestations and policy gates
Mutating tools should not execute on model enthusiasm alone. Harbor wires tiered approval gates into the audit stream so every write carries proof of authorization.
Attestation record
gate_tier— 0 (auto) through 3 (dual human)approver_id— null for tier 0; SSO subject for human tiersapproval_method— inline UI click, Slack reaction, ticket ID, break-glass codeapproval_latency_ms— time from proposal to executeproposal_digest— hash of the exact tool call the human saw
If the model revises arguments after approval, treat it as a new proposal —
never execute on a stale attestation. Harbor learned this when a journal-entry agent
changed the amount by $200K between approval screen and execution; the mismatch
now hard-fails with attestation_mismatch and emits a security event.
Human-in-the-loop continuity
When runs escalate to operators via
human-in-the-loop
queues, log queue entry, assignment, override reason, and release. Auditors care as
much about denied actions as approved ones — emit
approval.denied with rationale text (redacted if needed).
Redaction, encryption and access control
Audit logs that leak PII create a second compliance problem. Apply the same rules as PII redaction pipelines at write time, not at query time.
- Field-level classification — tag schema fields
public,internal,restricted,pci,phi - Tokenize identifiers — reversible tokens only in a separate vault with its own access log
- Encrypt at rest — per-tenant KMS keys; auditors get read-only role scoped to their entity
- Separate duties — engineers who deploy agents cannot delete audit rows; break-glass deletes emit meta-audit events
Model prompts and completions may contain customer data. Harbor stores prompt
content_hash in the audit stream and keeps full text in a 90-day
restricted vault — enough for dispute resolution without keeping every token
for seven years.
Retention, legal hold and replay
Retention tiers
| Event class | Typical retention | Notes |
|---|---|---|
| Read-only tool calls | 90–180 days | May downsample after 30 days to summary stats |
| Mutating tool calls | 7 years (SOX) or contract-defined | Never sample away; WORM storage |
| Approval attestations | Match mutating retention | Independent copy from workflow tool |
| Debug traces | 14–30 days | Not a substitute for audit tier |
Regulatory replay
Auditors should answer “show me this run” from a read-only console that
renders the hash-chained timeline — not from raw S3 grep. Harbor's replay
UI maps each tool.invoked to policy text, approver name, and a diff view
of ledger state. Replay is read-only; re-execution belongs in
staging with synthetic data.
Legal hold
When litigation starts, freeze retention jobs for affected tenant_id +
date range. Tag held events so TTL sweepers skip them; auto-expire hold when legal
clears the matter.
Harbor Finance refactor
Before the dedicated audit store, Harbor had: NetSuite native logs (no agent
context), OpenTelemetry (sampled, 14-day retention), and Slack approval threads
(editable, no hash chain). None linked run_id across layers.
Changes shipped
- Middleware on the agent runtime emits audit events synchronously before mutating HTTP returns — if audit write fails, tool call fails closed
- Tier-2+ approvals require
proposal_digestmatch; Slack bot signs attestations with a service key - Nightly job verifies hash chains per run; broken chains page security
- Auditor role in the replay UI — no production DB credentials
Outcomes
- SOX finding remediation: 11 days → 6 hours on the next review
- Unauthorized mutation attempts dropped 2.1% → 0.04% (mostly caught at attestation mismatch, not post-hoc)
- Storage cost +$1,400/month on 40k mutating runs — accepted vs audit risk
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| OpenTelemetry only | Internal dev agents, read-only tools | SOX, HIPAA, PCI; any sampled or short-TTL store |
| Chat transcript export | Low-stakes support bots | Proving tool side effects; non-repudiation; retention |
| Append-only audit event store | Regulated mutating agents, finance, healthcare | Cost at extreme volume without tiered retention |
| Blockchain / public ledger anchoring | Multi-party distrust, external attestations | Latency, cost, privacy for most enterprise agents |
| Workflow tool logs only (Jira, ServiceNow) | Human-only approvals | Agent autonomous steps between human clicks |
Common pitfalls
- Async audit write after tool success — crash between effect and log creates permanent gap; write before ack.
- Logging raw API keys or card numbers — turns audit store into a breach target; digest and tokenize.
- No hash chain or WORM — DBAs can edit rows; auditors will ask.
- Stale approval execution — model changes args post-approval; enforce
proposal_digestmatch. - Subagent mutations unattributed — parent run_id must cascade; child actor = delegated scope.
- Cancel without audit terminal — orphaned
run.startedlooks like incomplete fraud. - Same retention for reads and writes — storage bill explodes; tier mutating events separately.
- Replay that re-executes production tools — accidental double-post; replay UI is read-only.
Production checklist
- Define versioned audit event schema with
run_idcorrelation. - Emit events synchronously before mutating tool HTTP returns (fail closed).
- Hash-chain events per run; nightly integrity verification job.
- Store
policy_hashandproposal_digeston every write. - Wire approval attestations from permission gates into audit stream.
- Redact PII/PCI at write time; tokenize reversible identifiers in vault.
- Separate retention tiers: mutating (years) vs read-only (months).
- Read-only replay UI for auditors without production credentials.
- Legal hold flag bypassing TTL on affected tenants and date ranges.
- Export
audit_event_idinto traces for engineering cross-link. - Alert on hash-chain breaks and on mutating tools without attestation.
- Document what audit proves vs what still requires external system logs.
Key takeaways
- Traces debug; audit trails prove — different retention, schema, and consumers.
- Harbor cut SOX remediation 11 days → 6 hours with hash-chained attestations.
- Fail closed if audit write fails before a mutating tool acks.
- Approvals bind to digests, not to conversational intent.
- Replay is read-only — re-execution belongs in staging.
Related reading
- Agent observability and tracing — latency spans that complement but do not replace audit logs
- Permission scoping and approval gates — tiered gates that produce attestations
- Human-in-the-loop — escalation queues whose decisions must be logged
- Durable agent execution — run_id and checkpoint alignment with audit events