Guide
LLM agent cost attribution and token accounting explained
Harbor Support shipped a tier-1 ticket-resolution agent that looked cheap in staging: one GPT-4 class call per turn, a short system prompt, and a handful of tools. In production the mean run cost climbed to $12.40 per ticket. Finance assumed the model price list was wrong. It was not. Each ticket spawned an unbounded subagent research branch, three corrective RAG re-retrievals after low-confidence answers, and two full retry loops when the CRM tool timed out. Cloud invoices showed a single API key and one monthly total. Nobody could answer which workflow, tenant, or tool step burned the budget.
Cost attribution tags every token and external API dollar to a run, step, tool call, and tenant so product and finance can reason about unit economics. Token accounting maintains a ledger inside the agent runtime: input tokens, output tokens, cached-prefix hits, embedding calls, and third-party surcharges decrement a per-run budget and emit metrics before the run finishes. Harbor added span-level cost tags, a run ledger, and hard caps tied to loop termination; mean cost per resolved ticket fell to $4.10 over eight weeks while first-contact resolution stayed within one point of baseline. This guide covers why agent spend is opaque, ledger design, attribution dimensions, budget enforcement, integration with observability and multi-tenant isolation, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
Why agent costs are harder than chatbot costs
A single-turn chat app bills predictably: one request, one response, one line item. Agents multiply spend along several axes that cloud dashboards rarely split:
- Multi-turn loops — each ReAct iteration resends growing context; cost is superlinear in turns.
- Tool side effects — retrieval embeds, rerankers, web search, and OCR pipelines charge outside the main completion API.
- Subagent fan-out — a supervisor spawning three children can quadruple spend while the parent trace looks like “one run.”
- Retries and fallbacks — upgrading to a larger model on failure doubles cost for the same user-visible outcome.
- Shared infrastructure — one API key across teams hides which product line is uneconomic.
Without per-run ledgers, the only lever is blunt rate limits that throttle good traffic along with runaway loops.
What to measure
- Tokens — prompt, completion, cached, reasoning (where billed separately)
- External APIs — search, speech, vision, vector DB read units
- Wall time — correlates with worker occupancy even when tokens are flat
- Human time — escalations after budget exhaustion are a hidden cost
Run ledger and token accounting
A run ledger is an append-only record owned by the orchestrator, not
the model. Every billable event decrements budget_remaining_usd (or token
equivalents) and increments dimensional counters.
Ledger event shape
run_id,parent_run_id(for subagents),step_indexevent_type—llm_completion|embedding|tool_surchargemodel,input_tokens,output_tokens,cached_tokensunit_price_snapshot— price table version at event timecost_usd— computed server-side; never trust client estimatestenant_id,product,environment
Budget scopes
| Scope | Enforced when | Typical limit |
|---|---|---|
| Per run | Each ticket / task / session | $0.50–$8 depending on SLA |
| Per step | Before each LLM or expensive tool call | 25–40% of run cap |
| Per tenant / day | API gateway before run starts | Contract tier |
| Per subagent tree | Supervisor delegates work | Child inherits capped slice of parent budget |
When budget hits zero, the runtime should stop with a structured
budget_exhausted outcome — not silently truncate context and hallucinate.
Pair caps with
context budget
rules so compression triggers before hard failure.
Attribution dimensions and observability
Cost tags must ride the same identifiers as traces so engineers can pivot from a spike in spend to the exact tool span.
Minimum tag set
run_id,agent_name,agent_versiontenant_idorcustomer_idfor B2B attributionfeature— e.g.support_tier1,onboardingtool_nameon tool spans;modelon LLM spansoutcome— resolved, escalated, failed, budget_exhausted
Rollups finance actually uses
- Mean and p95 cost per successful outcome (not per attempt)
- Cost per tenant per month vs contract revenue
- Tool-level share: “42% of spend is embedding + rerank on search_kb”
- Model mix: share of spend on premium vs routed small models
Export ledger summaries to your metrics stack nightly. Cloud provider invoices are reconciliation inputs, not the source of truth for product decisions.
Tool metering and hidden surcharges
Tools are where agent economics go wrong. Wrap every tool executor with a metering hook that records estimated cost before execution and actual cost after.
Common surcharges
- Vector retrieval — embed query + top-k reads + reranker LLM call
- Web search — per-query vendor fee plus summarization pass
- Browser automation — headless browser minutes and captcha solves
- Code sandbox — CPU-seconds and egress
- Human approval wait — worker lease while paused at a gate
Metering pattern
- Declare
estimated_cost_usdin the tool manifest for planner visibility - Reserve budget atomically before call; release unused portion on fast success
- On timeout, charge reserved amount and surface partial result to the model
- Block tools whose rolling p95 cost exceeds policy without explicit override
For parallel tool calls, reserve the sum of estimates up front; concurrent calls should not each assume full budget availability.
Enforcement vs reporting
Attribution without enforcement documents waste after the fact. Production systems need both.
Enforcement modes
- Hard cap — abort run at budget; escalate to human or cheaper path
- Soft cap — switch to smaller model and disable expensive tools
- Progressive degradation — reduce retrieval
k, skip reranker, shorten history - Admission control — reject new runs when tenant daily budget exceeded
Reporting-only mode
Useful during rollout: emit ledger events and dashboards but do not block. Limit reporting-only to two weeks; teams ignore dashboards that never cut a run.
Tie enforcement to model routing: when spend velocity exceeds a threshold mid-run, route subsequent steps to a cheaper model with a system note that precision may drop.
Harbor Support refactor
Before metering, Harbor attributed LLM spend only at the Zendesk integration API key. Support leads guessed cost per ticket from monthly invoice / ticket volume. Subagent research, embedding indexes, and retry storms were invisible.
Changes shipped
- Run ledger in the orchestrator with per-ticket
budget_usd: 6.00default - OpenTelemetry spans carry
cost.usdand cumulativerun.cost.usdattributes - Subagents inherit max 35% of parent remaining budget; supervisor cannot spawn unbounded children
- Corrective RAG limited to one retry unless confidence < 0.4; second retry requires human queue
- Nightly FinOps report: cost per outcome, per tenant, per tool; Slack alert on p95 regression
Outcomes
- Mean LLM cost per resolved ticket: $12.40 → $4.10
- Runs hitting hard cap: 8.1% (mostly escalated to tier-2 as designed)
- First-contact resolution: 61% → 60% (within noise)
- Identified top cost driver: duplicate embedding on unchanged KB queries (fixed with query cache)
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Cloud invoice only | Early prototypes, single team, one workflow | Multi-tenant products, agent loops, FinOps accountability |
| Per-request logging without ledger | Simple chat APIs | Multi-step runs; cannot enforce cumulative caps |
| Run ledger + hard caps | Production agents with variable tool paths | Research jobs needing unbounded exploration (use separate keys) |
| Tenant daily quotas at gateway | SaaS with plan tiers | Diagnosing which agent step blew a single run |
| Chargeback to internal teams | Large orgs with shared platform | Startups before unit economics are understood |
Common pitfalls
- Trusting client-reported token counts — always compute from provider usage objects server-side.
- Ignoring cached-token pricing — prefix cache hits change marginal cost; ledger must track them.
- Subagents without parent budget slice — fan-out becomes a cost denial-of-service.
- Retries that bypass the ledger — each retry must append events, not open a new untracked run.
- Embedding spend in a different project — retrieval cost vanishes from agent dashboards.
- Hard cap without graceful degradation — users get gibberish instead of a clean escalation.
- Stale price tables — model price cuts or surcharges make FinOps reports lie until updated.
- Attribution without outcome tags — optimizing mean cost per attempt rewards cheap failures.
Production checklist
- Every run has a server-issued
run_idandbudget_usdbefore first LLM call. - Ledger appends events for completions, embeddings, and tool surcharges.
- Subagents inherit a capped fraction of parent remaining budget.
- Spans export
cost.usdand cumulative run cost attributes. - Hard or soft caps trigger structured termination, not silent truncation.
- Price table version snapshotted on each ledger event.
- Tenant and feature dimensions present on all billable events.
- Parallel tools reserve aggregate estimated cost before execution.
- Nightly rollups: mean/p95 cost per successful outcome and per tenant.
- Alerts on week-over-week p95 regression by agent version.
- Cloud invoice reconciled to ledger totals monthly; drift investigated.
- FinOps review before raising default per-run budgets.
Key takeaways
- Agent spend is path-dependent — invoices alone cannot show which tool loop burned the budget.
- Harbor cut mean ticket cost $12.40 → $4.10 with run ledgers, subagent caps, and RAG retry limits.
- Attribute on the same IDs as traces — run, tenant, tool, outcome.
- Enforce budgets in the orchestrator, not in post-hoc spreadsheets.
- Measure cost per successful outcome, not per API request.
Related reading
- Context budget and token management — compress context before spend caps bite
- Agent observability and tracing — spans that carry cost attributes
- Loop termination — stop criteria tied to budget exhaustion
- Rate limiting and quota management — gateway-level tenant throttles