Guide

LLM agent run audit trail and compliance logging explained

Harbor Finance deployed a month-end close agent that reconciled sub-ledgers and posted adjusting journal entries through NetSuite. During a SOX walkthrough, auditors asked for evidence that a $4.2M debit to accrued liabilities was authorized. OpenTelemetry traces showed the agent called post_journal_entry in 1.8 seconds — but spans do not record who approved a mutating action, which policy version gated it, or what the ledger looked like before and after. Internal controls flagged a material weakness; remediation took eleven days of manual log archaeology across three systems.

Agent audit trails are append-only, tamper-evident event streams designed for regulators, security teams, and legal — not for debugging latency. They differ from distributed traces (ephemeral, sampled, engineer-oriented) and from chat transcripts (mutable, incomplete, often missing tool side effects). Harbor split concerns: traces for SRE, a dedicated agent_audit store for compliance. Each mutating tool call now emits a signed event with actor, policy hash, approval attestation, argument digest, and before/after state references. SOX remediation on the next finding dropped from 11 days to 6 hours because auditors could replay the run timeline without engineering access. This guide covers audit event schemas, attestation chains, redaction, retention, the Harbor Finance refactor, a technique decision table, pitfalls, and a production checklist.

Audit logs vs observability traces

Teams often point auditors at the same Jaeger or Datadog deployment engineers use. That fails compliance reviews for predictable reasons.

What traces optimize for

Latency and error rates — parent/child span trees, percentile dashboards
Sampling — 1–10% of traffic is normal; auditors need 100% of mutating actions
Short retention — 7–30 days is typical; SOX and GDPR often require years
Mutable backends — re-indexing and TTL deletes are features, not bugs

What audit trails must prove

Non-repudiation — the event happened; the log was not silently edited afterward
Actor identity — human user, service account, or delegated subagent with scope
Policy context — which permission manifest and approval gate version applied
Causality — ordered chain from user intent through model turn to side effect
Reconstructability — enough payload to explain the decision without raw secrets

Keep both systems. Export a correlation ID from audit events into traces so engineers can jump from a compliance ticket to a flame graph — but never treat traces as the system of record for regulated mutations.

Immutable event schema

Design one canonical JSON (or Protobuf) envelope every agent runtime emits. Version the schema; never break readers on old runs.

Core fields

event_id — UUIDv7 or ULID for time-sortable uniqueness
run_id — ties to durable execution checkpoints
tenant_id / environment — prod vs staging isolation
event_type — run.started, tool.invoked, tool.completed, approval.granted, run.terminal
actor — { type: human|agent|system, id, session_id }
timestamp_utc — server clock; include client_timestamp only as hint
policy_hash — SHA-256 of the active capability manifest at invoke time
prev_event_hash — hash chain for tamper detection within a run

Tool invocation payload

Log digests, not raw secrets. Store args_sha256 plus a redacted preview (account_id: ****4821). For financial or healthcare tools, attach state_before_ref and state_after_ref pointers to immutable object storage (WORM bucket or ledger table) rather than inline blobs.

Terminal events

Every run ends with run.succeeded, run.failed, run.cancelled, or run.timed_out — mirroring the FSM in cancellation lifecycle guides. Include side_effect_count, mutating_tool_count, and total_cost_usd for chargeback and anomaly detection.

Approval attestations and policy gates

Mutating tools should not execute on model enthusiasm alone. Harbor wires tiered approval gates into the audit stream so every write carries proof of authorization.

Attestation record

gate_tier — 0 (auto) through 3 (dual human)
approver_id — null for tier 0; SSO subject for human tiers
approval_method — inline UI click, Slack reaction, ticket ID, break-glass code
approval_latency_ms — time from proposal to execute
proposal_digest — hash of the exact tool call the human saw

If the model revises arguments after approval, treat it as a new proposal — never execute on a stale attestation. Harbor learned this when a journal-entry agent changed the amount by $200K between approval screen and execution; the mismatch now hard-fails with attestation_mismatch and emits a security event.

Human-in-the-loop continuity

When runs escalate to operators via human-in-the-loop queues, log queue entry, assignment, override reason, and release. Auditors care as much about denied actions as approved ones — emit approval.denied with rationale text (redacted if needed).

Redaction, encryption and access control

Audit logs that leak PII create a second compliance problem. Apply the same rules as PII redaction pipelines at write time, not at query time.

Field-level classification — tag schema fields public, internal, restricted, pci, phi
Tokenize identifiers — reversible tokens only in a separate vault with its own access log
Encrypt at rest — per-tenant KMS keys; auditors get read-only role scoped to their entity
Separate duties — engineers who deploy agents cannot delete audit rows; break-glass deletes emit meta-audit events

Model prompts and completions may contain customer data. Harbor stores prompt content_hash in the audit stream and keeps full text in a 90-day restricted vault — enough for dispute resolution without keeping every token for seven years.

Retention, legal hold and replay

Retention tiers

Event class	Typical retention	Notes
Read-only tool calls	90–180 days	May downsample after 30 days to summary stats
Mutating tool calls	7 years (SOX) or contract-defined	Never sample away; WORM storage
Approval attestations	Match mutating retention	Independent copy from workflow tool
Debug traces	14–30 days	Not a substitute for audit tier

Regulatory replay

Auditors should answer “show me this run” from a read-only console that renders the hash-chained timeline — not from raw S3 grep. Harbor's replay UI maps each tool.invoked to policy text, approver name, and a diff view of ledger state. Replay is read-only; re-execution belongs in staging with synthetic data.

Legal hold

When litigation starts, freeze retention jobs for affected tenant_id + date range. Tag held events so TTL sweepers skip them; auto-expire hold when legal clears the matter.

Harbor Finance refactor

Before the dedicated audit store, Harbor had: NetSuite native logs (no agent context), OpenTelemetry (sampled, 14-day retention), and Slack approval threads (editable, no hash chain). None linked run_id across layers.

Changes shipped

Middleware on the agent runtime emits audit events synchronously before mutating HTTP returns — if audit write fails, tool call fails closed
Tier-2+ approvals require proposal_digest match; Slack bot signs attestations with a service key
Nightly job verifies hash chains per run; broken chains page security
Auditor role in the replay UI — no production DB credentials

Outcomes

SOX finding remediation: 11 days → 6 hours on the next review
Unauthorized mutation attempts dropped 2.1% → 0.04% (mostly caught at attestation mismatch, not post-hoc)
Storage cost +$1,400/month on 40k mutating runs — accepted vs audit risk

Technique decision table

Approach	Best for	Weak when
OpenTelemetry only	Internal dev agents, read-only tools	SOX, HIPAA, PCI; any sampled or short-TTL store
Chat transcript export	Low-stakes support bots	Proving tool side effects; non-repudiation; retention
Append-only audit event store	Regulated mutating agents, finance, healthcare	Cost at extreme volume without tiered retention
Blockchain / public ledger anchoring	Multi-party distrust, external attestations	Latency, cost, privacy for most enterprise agents
Workflow tool logs only (Jira, ServiceNow)	Human-only approvals	Agent autonomous steps between human clicks

Common pitfalls

Async audit write after tool success — crash between effect and log creates permanent gap; write before ack.
Logging raw API keys or card numbers — turns audit store into a breach target; digest and tokenize.
No hash chain or WORM — DBAs can edit rows; auditors will ask.
Stale approval execution — model changes args post-approval; enforce proposal_digest match.
Subagent mutations unattributed — parent run_id must cascade; child actor = delegated scope.
Cancel without audit terminal — orphaned run.started looks like incomplete fraud.
Same retention for reads and writes — storage bill explodes; tier mutating events separately.
Replay that re-executes production tools — accidental double-post; replay UI is read-only.

Production checklist

Define versioned audit event schema with run_id correlation.
Emit events synchronously before mutating tool HTTP returns (fail closed).
Hash-chain events per run; nightly integrity verification job.
Store policy_hash and proposal_digest on every write.
Wire approval attestations from permission gates into audit stream.
Redact PII/PCI at write time; tokenize reversible identifiers in vault.
Separate retention tiers: mutating (years) vs read-only (months).
Read-only replay UI for auditors without production credentials.
Legal hold flag bypassing TTL on affected tenants and date ranges.
Export audit_event_id into traces for engineering cross-link.
Alert on hash-chain breaks and on mutating tools without attestation.
Document what audit proves vs what still requires external system logs.

Key takeaways

Traces debug; audit trails prove — different retention, schema, and consumers.
Harbor cut SOX remediation 11 days → 6 hours with hash-chained attestations.
Fail closed if audit write fails before a mutating tool acks.
Approvals bind to digests, not to conversational intent.
Replay is read-only — re-execution belongs in staging.