Guide

LLM agent cost attribution and token accounting explained

Harbor Support shipped a tier-1 ticket-resolution agent that looked cheap in staging: one GPT-4 class call per turn, a short system prompt, and a handful of tools. In production the mean run cost climbed to $12.40 per ticket. Finance assumed the model price list was wrong. It was not. Each ticket spawned an unbounded subagent research branch, three corrective RAG re-retrievals after low-confidence answers, and two full retry loops when the CRM tool timed out. Cloud invoices showed a single API key and one monthly total. Nobody could answer which workflow, tenant, or tool step burned the budget.

Cost attribution tags every token and external API dollar to a run, step, tool call, and tenant so product and finance can reason about unit economics. Token accounting maintains a ledger inside the agent runtime: input tokens, output tokens, cached-prefix hits, embedding calls, and third-party surcharges decrement a per-run budget and emit metrics before the run finishes. Harbor added span-level cost tags, a run ledger, and hard caps tied to loop termination; mean cost per resolved ticket fell to $4.10 over eight weeks while first-contact resolution stayed within one point of baseline. This guide covers why agent spend is opaque, ledger design, attribution dimensions, budget enforcement, integration with observability and multi-tenant isolation, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why agent costs are harder than chatbot costs

A single-turn chat app bills predictably: one request, one response, one line item. Agents multiply spend along several axes that cloud dashboards rarely split:

  • Multi-turn loops — each ReAct iteration resends growing context; cost is superlinear in turns.
  • Tool side effects — retrieval embeds, rerankers, web search, and OCR pipelines charge outside the main completion API.
  • Subagent fan-out — a supervisor spawning three children can quadruple spend while the parent trace looks like “one run.”
  • Retries and fallbacks — upgrading to a larger model on failure doubles cost for the same user-visible outcome.
  • Shared infrastructure — one API key across teams hides which product line is uneconomic.

Without per-run ledgers, the only lever is blunt rate limits that throttle good traffic along with runaway loops.

What to measure

  • Tokens — prompt, completion, cached, reasoning (where billed separately)
  • External APIs — search, speech, vision, vector DB read units
  • Wall time — correlates with worker occupancy even when tokens are flat
  • Human time — escalations after budget exhaustion are a hidden cost

Run ledger and token accounting

A run ledger is an append-only record owned by the orchestrator, not the model. Every billable event decrements budget_remaining_usd (or token equivalents) and increments dimensional counters.

Ledger event shape

  • run_id, parent_run_id (for subagents), step_index
  • event_typellm_completion | embedding | tool_surcharge
  • model, input_tokens, output_tokens, cached_tokens
  • unit_price_snapshot — price table version at event time
  • cost_usd — computed server-side; never trust client estimates
  • tenant_id, product, environment

Budget scopes

Scope Enforced when Typical limit
Per run Each ticket / task / session $0.50–$8 depending on SLA
Per step Before each LLM or expensive tool call 25–40% of run cap
Per tenant / day API gateway before run starts Contract tier
Per subagent tree Supervisor delegates work Child inherits capped slice of parent budget

When budget hits zero, the runtime should stop with a structured budget_exhausted outcome — not silently truncate context and hallucinate. Pair caps with context budget rules so compression triggers before hard failure.

Attribution dimensions and observability

Cost tags must ride the same identifiers as traces so engineers can pivot from a spike in spend to the exact tool span.

Minimum tag set

  • run_id, agent_name, agent_version
  • tenant_id or customer_id for B2B attribution
  • feature — e.g. support_tier1, onboarding
  • tool_name on tool spans; model on LLM spans
  • outcome — resolved, escalated, failed, budget_exhausted

Rollups finance actually uses

  • Mean and p95 cost per successful outcome (not per attempt)
  • Cost per tenant per month vs contract revenue
  • Tool-level share: “42% of spend is embedding + rerank on search_kb”
  • Model mix: share of spend on premium vs routed small models

Export ledger summaries to your metrics stack nightly. Cloud provider invoices are reconciliation inputs, not the source of truth for product decisions.

Tool metering and hidden surcharges

Tools are where agent economics go wrong. Wrap every tool executor with a metering hook that records estimated cost before execution and actual cost after.

Common surcharges

  • Vector retrieval — embed query + top-k reads + reranker LLM call
  • Web search — per-query vendor fee plus summarization pass
  • Browser automation — headless browser minutes and captcha solves
  • Code sandbox — CPU-seconds and egress
  • Human approval wait — worker lease while paused at a gate

Metering pattern

  1. Declare estimated_cost_usd in the tool manifest for planner visibility
  2. Reserve budget atomically before call; release unused portion on fast success
  3. On timeout, charge reserved amount and surface partial result to the model
  4. Block tools whose rolling p95 cost exceeds policy without explicit override

For parallel tool calls, reserve the sum of estimates up front; concurrent calls should not each assume full budget availability.

Enforcement vs reporting

Attribution without enforcement documents waste after the fact. Production systems need both.

Enforcement modes

  • Hard cap — abort run at budget; escalate to human or cheaper path
  • Soft cap — switch to smaller model and disable expensive tools
  • Progressive degradation — reduce retrieval k, skip reranker, shorten history
  • Admission control — reject new runs when tenant daily budget exceeded

Reporting-only mode

Useful during rollout: emit ledger events and dashboards but do not block. Limit reporting-only to two weeks; teams ignore dashboards that never cut a run.

Tie enforcement to model routing: when spend velocity exceeds a threshold mid-run, route subsequent steps to a cheaper model with a system note that precision may drop.

Harbor Support refactor

Before metering, Harbor attributed LLM spend only at the Zendesk integration API key. Support leads guessed cost per ticket from monthly invoice / ticket volume. Subagent research, embedding indexes, and retry storms were invisible.

Changes shipped

  1. Run ledger in the orchestrator with per-ticket budget_usd: 6.00 default
  2. OpenTelemetry spans carry cost.usd and cumulative run.cost.usd attributes
  3. Subagents inherit max 35% of parent remaining budget; supervisor cannot spawn unbounded children
  4. Corrective RAG limited to one retry unless confidence < 0.4; second retry requires human queue
  5. Nightly FinOps report: cost per outcome, per tenant, per tool; Slack alert on p95 regression

Outcomes

  • Mean LLM cost per resolved ticket: $12.40 → $4.10
  • Runs hitting hard cap: 8.1% (mostly escalated to tier-2 as designed)
  • First-contact resolution: 61% → 60% (within noise)
  • Identified top cost driver: duplicate embedding on unchanged KB queries (fixed with query cache)

Technique decision table

Approach Best for Weak when
Cloud invoice only Early prototypes, single team, one workflow Multi-tenant products, agent loops, FinOps accountability
Per-request logging without ledger Simple chat APIs Multi-step runs; cannot enforce cumulative caps
Run ledger + hard caps Production agents with variable tool paths Research jobs needing unbounded exploration (use separate keys)
Tenant daily quotas at gateway SaaS with plan tiers Diagnosing which agent step blew a single run
Chargeback to internal teams Large orgs with shared platform Startups before unit economics are understood

Common pitfalls

  • Trusting client-reported token counts — always compute from provider usage objects server-side.
  • Ignoring cached-token pricing — prefix cache hits change marginal cost; ledger must track them.
  • Subagents without parent budget slice — fan-out becomes a cost denial-of-service.
  • Retries that bypass the ledger — each retry must append events, not open a new untracked run.
  • Embedding spend in a different project — retrieval cost vanishes from agent dashboards.
  • Hard cap without graceful degradation — users get gibberish instead of a clean escalation.
  • Stale price tables — model price cuts or surcharges make FinOps reports lie until updated.
  • Attribution without outcome tags — optimizing mean cost per attempt rewards cheap failures.

Production checklist

  • Every run has a server-issued run_id and budget_usd before first LLM call.
  • Ledger appends events for completions, embeddings, and tool surcharges.
  • Subagents inherit a capped fraction of parent remaining budget.
  • Spans export cost.usd and cumulative run cost attributes.
  • Hard or soft caps trigger structured termination, not silent truncation.
  • Price table version snapshotted on each ledger event.
  • Tenant and feature dimensions present on all billable events.
  • Parallel tools reserve aggregate estimated cost before execution.
  • Nightly rollups: mean/p95 cost per successful outcome and per tenant.
  • Alerts on week-over-week p95 regression by agent version.
  • Cloud invoice reconciled to ledger totals monthly; drift investigated.
  • FinOps review before raising default per-run budgets.

Key takeaways

  • Agent spend is path-dependent — invoices alone cannot show which tool loop burned the budget.
  • Harbor cut mean ticket cost $12.40 → $4.10 with run ledgers, subagent caps, and RAG retry limits.
  • Attribute on the same IDs as traces — run, tenant, tool, outcome.
  • Enforce budgets in the orchestrator, not in post-hoc spreadsheets.
  • Measure cost per successful outcome, not per API request.

Related reading