Guide

LLM agent cost attribution and token accounting explained

Harbor Support shipped a tier-1 ticket-resolution agent that looked cheap in staging: one GPT-4 class call per turn, a short system prompt, and a handful of tools. In production the mean run cost climbed to $12.40 per ticket. Finance assumed the model price list was wrong. It was not. Each ticket spawned an unbounded subagent research branch, three corrective RAG re-retrievals after low-confidence answers, and two full retry loops when the CRM tool timed out. Cloud invoices showed a single API key and one monthly total. Nobody could answer which workflow, tenant, or tool step burned the budget.

Cost attribution tags every token and external API dollar to a run, step, tool call, and tenant so product and finance can reason about unit economics. Token accounting maintains a ledger inside the agent runtime: input tokens, output tokens, cached-prefix hits, embedding calls, and third-party surcharges decrement a per-run budget and emit metrics before the run finishes. Harbor added span-level cost tags, a run ledger, and hard caps tied to loop termination; mean cost per resolved ticket fell to $4.10 over eight weeks while first-contact resolution stayed within one point of baseline. This guide covers why agent spend is opaque, ledger design, attribution dimensions, budget enforcement, integration with observability and multi-tenant isolation, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why agent costs are harder than chatbot costs

A single-turn chat app bills predictably: one request, one response, one line item. Agents multiply spend along several axes that cloud dashboards rarely split:

Multi-turn loops — each ReAct iteration resends growing context; cost is superlinear in turns.
Tool side effects — retrieval embeds, rerankers, web search, and OCR pipelines charge outside the main completion API.
Subagent fan-out — a supervisor spawning three children can quadruple spend while the parent trace looks like “one run.”
Retries and fallbacks — upgrading to a larger model on failure doubles cost for the same user-visible outcome.
Shared infrastructure — one API key across teams hides which product line is uneconomic.

Without per-run ledgers, the only lever is blunt rate limits that throttle good traffic along with runaway loops.

What to measure

Tokens — prompt, completion, cached, reasoning (where billed separately)
External APIs — search, speech, vision, vector DB read units
Wall time — correlates with worker occupancy even when tokens are flat
Human time — escalations after budget exhaustion are a hidden cost

Run ledger and token accounting

A run ledger is an append-only record owned by the orchestrator, not the model. Every billable event decrements budget_remaining_usd (or token equivalents) and increments dimensional counters.

Ledger event shape

run_id, parent_run_id (for subagents), step_index
event_type — llm_completion | embedding | tool_surcharge
model, input_tokens, output_tokens, cached_tokens
unit_price_snapshot — price table version at event time
cost_usd — computed server-side; never trust client estimates
tenant_id, product, environment

Budget scopes

Scope	Enforced when	Typical limit
Per run	Each ticket / task / session	$0.50–$8 depending on SLA
Per step	Before each LLM or expensive tool call	25–40% of run cap
Per tenant / day	API gateway before run starts	Contract tier
Per subagent tree	Supervisor delegates work	Child inherits capped slice of parent budget

When budget hits zero, the runtime should stop with a structured budget_exhausted outcome — not silently truncate context and hallucinate. Pair caps with context budget rules so compression triggers before hard failure.

Attribution dimensions and observability

Cost tags must ride the same identifiers as traces so engineers can pivot from a spike in spend to the exact tool span.

Minimum tag set

run_id, agent_name, agent_version
tenant_id or customer_id for B2B attribution
feature — e.g. support_tier1, onboarding
tool_name on tool spans; model on LLM spans
outcome — resolved, escalated, failed, budget_exhausted

Rollups finance actually uses

Mean and p95 cost per successful outcome (not per attempt)
Cost per tenant per month vs contract revenue
Tool-level share: “42% of spend is embedding + rerank on search_kb”
Model mix: share of spend on premium vs routed small models

Export ledger summaries to your metrics stack nightly. Cloud provider invoices are reconciliation inputs, not the source of truth for product decisions.

Tool metering and hidden surcharges

Tools are where agent economics go wrong. Wrap every tool executor with a metering hook that records estimated cost before execution and actual cost after.

Common surcharges

Vector retrieval — embed query + top-k reads + reranker LLM call
Web search — per-query vendor fee plus summarization pass
Browser automation — headless browser minutes and captcha solves
Code sandbox — CPU-seconds and egress
Human approval wait — worker lease while paused at a gate

Metering pattern

Declare estimated_cost_usd in the tool manifest for planner visibility
Reserve budget atomically before call; release unused portion on fast success
On timeout, charge reserved amount and surface partial result to the model
Block tools whose rolling p95 cost exceeds policy without explicit override

For parallel tool calls, reserve the sum of estimates up front; concurrent calls should not each assume full budget availability.

Enforcement vs reporting

Attribution without enforcement documents waste after the fact. Production systems need both.

Enforcement modes

Hard cap — abort run at budget; escalate to human or cheaper path
Soft cap — switch to smaller model and disable expensive tools
Progressive degradation — reduce retrieval k, skip reranker, shorten history
Admission control — reject new runs when tenant daily budget exceeded

Reporting-only mode

Useful during rollout: emit ledger events and dashboards but do not block. Limit reporting-only to two weeks; teams ignore dashboards that never cut a run.

Tie enforcement to model routing: when spend velocity exceeds a threshold mid-run, route subsequent steps to a cheaper model with a system note that precision may drop.

Harbor Support refactor

Before metering, Harbor attributed LLM spend only at the Zendesk integration API key. Support leads guessed cost per ticket from monthly invoice / ticket volume. Subagent research, embedding indexes, and retry storms were invisible.

Changes shipped

Run ledger in the orchestrator with per-ticket budget_usd: 6.00 default
OpenTelemetry spans carry cost.usd and cumulative run.cost.usd attributes
Subagents inherit max 35% of parent remaining budget; supervisor cannot spawn unbounded children
Corrective RAG limited to one retry unless confidence < 0.4; second retry requires human queue
Nightly FinOps report: cost per outcome, per tenant, per tool; Slack alert on p95 regression

Outcomes

Mean LLM cost per resolved ticket: $12.40 → $4.10
Runs hitting hard cap: 8.1% (mostly escalated to tier-2 as designed)
First-contact resolution: 61% → 60% (within noise)
Identified top cost driver: duplicate embedding on unchanged KB queries (fixed with query cache)

Technique decision table

Approach	Best for	Weak when
Cloud invoice only	Early prototypes, single team, one workflow	Multi-tenant products, agent loops, FinOps accountability
Per-request logging without ledger	Simple chat APIs	Multi-step runs; cannot enforce cumulative caps
Run ledger + hard caps	Production agents with variable tool paths	Research jobs needing unbounded exploration (use separate keys)
Tenant daily quotas at gateway	SaaS with plan tiers	Diagnosing which agent step blew a single run
Chargeback to internal teams	Large orgs with shared platform	Startups before unit economics are understood

Common pitfalls

Trusting client-reported token counts — always compute from provider usage objects server-side.
Ignoring cached-token pricing — prefix cache hits change marginal cost; ledger must track them.
Subagents without parent budget slice — fan-out becomes a cost denial-of-service.
Retries that bypass the ledger — each retry must append events, not open a new untracked run.
Embedding spend in a different project — retrieval cost vanishes from agent dashboards.
Hard cap without graceful degradation — users get gibberish instead of a clean escalation.
Stale price tables — model price cuts or surcharges make FinOps reports lie until updated.
Attribution without outcome tags — optimizing mean cost per attempt rewards cheap failures.

Production checklist

Every run has a server-issued run_id and budget_usd before first LLM call.
Ledger appends events for completions, embeddings, and tool surcharges.
Subagents inherit a capped fraction of parent remaining budget.
Spans export cost.usd and cumulative run cost attributes.
Hard or soft caps trigger structured termination, not silent truncation.
Price table version snapshotted on each ledger event.
Tenant and feature dimensions present on all billable events.
Parallel tools reserve aggregate estimated cost before execution.
Nightly rollups: mean/p95 cost per successful outcome and per tenant.
Alerts on week-over-week p95 regression by agent version.
Cloud invoice reconciled to ledger totals monthly; drift investigated.
FinOps review before raising default per-run budgets.

Key takeaways

Agent spend is path-dependent — invoices alone cannot show which tool loop burned the budget.
Harbor cut mean ticket cost $12.40 → $4.10 with run ledgers, subagent caps, and RAG retry limits.
Attribute on the same IDs as traces — run, tenant, tool, outcome.
Enforce budgets in the orchestrator, not in post-hoc spreadsheets.
Measure cost per successful outcome, not per API request.