Guide

LLM agent model routing and intent classification systems explained

Harbor Desk ships a customer-support agent with CRM read access, refund tools, and a knowledge base of 40,000 articles. At launch every ticket — from “Where is my order?” to “Reconcile three invoices against our enterprise contract” — hit the same flagship model. FinOps review showed 52% of monthly inference spend on runs that needed only a lookup and a templated reply. CSAT was flat, but unit economics were not sustainable at scale.

The refactor added an intent classification and model routing layer before the first generation step: lightweight classifiers score task complexity, required tool depth, and latency tier; a routing policy picks mini, standard, or reasoning model profiles. Runs that stall or fail validation escalate mid-flight to a higher tier instead of failing the user. Overspend on simple intents fell from 52% to 18%; CSAT held within 0.3 points. This guide explains intent taxonomy, classifier stacks, tier matrices, escalation cascades, Harbor Desk's refactor, a technique decision table, pitfalls, and a production checklist. It complements model fallback on outages, cost attribution, and general LLM routing cascades — each solves a different problem.

What model routing does in agent systems

Model routing selects which LLM profile serves a given agent run before and during execution, based on predicted task requirements. It is distinct from fallback routing, which reacts when a provider errors or degrades, and from tool routing, which decides which APIs the agent may call.

Core responsibilities

  • Classify intent and complexity — FAQ lookup vs multi-step workflow vs reasoning-heavy analysis; estimate tool-call depth and context size.
  • Match tier to SLO envelope — chat widgets need sub-2s first token; batch jobs tolerate minutes on a reasoning model.
  • Respect cost and quota budgets — per-tenant daily caps, per-run token ceilings, and premium-tier rate limits.
  • Escalate on failure signals — low confidence, repeated tool errors, guardrail blocks, or evaluator scores below threshold trigger tier promotion.
  • Pin model for run consistency — once a run starts, the chosen profile stays fixed unless an explicit escalation event fires; avoids mid-conversation voice drift.
  • Emit routing audit events — classifier scores, chosen tier, escalation reason, and realized token cost for FinOps and regression analysis.

Routing sits early in the agent pipeline — typically after ingress security and before context assembly — and pairs with SLA deadline enforcement so latency tiers are enforceable, not aspirational.

Intent classification pipeline

Production classifiers rarely rely on a single LLM call. Harbor Desk uses a three-stage cascade tuned for sub-40 ms p95 routing latency:

Stage 1: Rules and metadata (deterministic)

  • Channel tags (widget vs email vs API) set baseline tier ceilings.
  • Regex and keyword maps catch high-confidence intents: password reset, order-status template, billing portal link.
  • Attachment count, message length, and language locale adjust complexity priors.
  • Enterprise tenant flags may mandate minimum tier for compliance logging.

Stage 2: Lightweight ML classifier (fast)

  • Fine-tuned small encoder or embedding + logistic head over normalized user text.
  • Output: intent label (lookup, troubleshoot, write-action, analysis) plus complexity score 0–1.
  • Trained on labeled production traces; refreshed weekly from human-reviewed escalations.

Stage 3: LLM judge (borderline only)

  • Runs only when stage-2 confidence falls in the 0.35–0.65 band (~11% of traffic).
  • Mini model with structured JSON output: recommended tier, rationale code, estimated tool steps.
  • Hard timeout 800 ms; on timeout, default to standard tier (never mini for ambiguous write paths).

Classifier outputs feed a routing policy engine — not directly the model API. Policies encode business rules: refund tools never run on mini tier; read-only FAQ on mini; multi-document analysis requires reasoning tier unless SLA tier is EXPRESS (then standard with truncated context).

Model tier matrix and routing policies

Define tiers by capability bundle, not vendor SKU alone. Each tier maps to: model id, max context, tool allowlist, max loop steps, temperature profile, and cost multiplier.

TierTypical useTool depthLatency targetEscalate when
MiniFAQ, status lookup, templated reply0–1 read< 1.5 s TTFTWrite tool requested; confidence < 0.7 after gen
StandardTroubleshoot, single write, RAG Q&A1–4 mixed< 3 s TTFT3+ tool failures; guardrail repair loop > 2
ReasoningMulti-doc analysis, policy edge cases5+ or chain-of-thought< 8 s TTFTHuman escalation; budget exhausted

Policy dimensions beyond intent

  • SLA class — EXPRESS runs may skip mini even for lookup if queue depth threatens deadline; BATCH may downgrade with user consent.
  • Tenant plan — free tier capped at standard; enterprise may default reasoning for audit-sensitive channels.
  • Session history — prior escalation in thread pins minimum standard for remainder of session.
  • Feature flags — canary reasoning model on 5% of standard-tier traffic for eval; routed via experiment assignment.

Policies are versioned and deployed like prompt templates. A bad policy change can shift 80% of traffic to the wrong tier overnight — pair routing changes with shadow metrics before full promotion.

Mid-run escalation and cascade mechanics

Upfront classification is wrong often enough that mid-run escalation is mandatory. Harbor Desk promotes tier when any trigger fires:

  • Tool request mismatch — mini tier model emits a write-tool call; runtime blocks and re-queues on standard with condensed context.
  • Evaluator failure — output guardrail or grounding check fails twice; escalate with failure reason injected as system note.
  • Low self-reported confidence — structured output includes confidence field below 0.6 on factual claims.
  • User explicit upgrade — “Talk to a specialist” or “This needs a deeper review” bumps tier and may trigger HITL.
  • Token budget burn rate — 70% of run budget consumed in step 1 with no final answer; promote or hand off.

Escalation replays context rather than continuing the same completion: summarize prior steps into a structured handoff block (goal, tools tried, errors, partial answer) to avoid token waste and confusion. Cap escalations at one per run unless HITL approves a second — prevents infinite tier climbing on adversarial inputs.

Downgrade mid-run is rare and usually user-initiated (“Just give me the tracking link”). Automatic downgrade risks quality regressions; prefer early mini routing over late downgrade.

Observability and FinOps integration

Routing without measurement optimizes the wrong metric. Wire every decision into token accounting:

  • routing.intent_label, routing.tier_chosen, routing.classifier_confidence on run start span.
  • routing.escalated, routing.escalation_reason on promotion events.
  • Weekly dashboards: spend by tier, escalation rate by intent, CSAT by tier, false-mini rate (runs escalated within 2 steps).
  • Golden-set regression: 500 labeled tickets; alert if mini recall on true-mini drops below 92% or false-mini on write-intent exceeds 0.5%.

Compare realized cost to counterfactual “all flagship” monthly to justify routing infra. Harbor Desk's 41% unit-cost reduction funded two classifier retrain cycles and a dedicated routing on-call rotation.

Harbor Desk refactor (worked example)

Before routing, Harbor Desk characteristics:

  • Single flagship model for all channels; p95 cost $0.084 per run.
  • 52% of runs: one KB retrieval + short reply; median 1,800 input tokens.
  • Write-tool runs (refunds, address changes): 14% of volume but 31% of spend.
  • No escalation path except full human handoff on hard failure.

Refactor shipped in four slices:

  1. Intent labels v1 — 6 labels from 12k human-reviewed tickets; stage-2 encoder classifier.
  2. Tier matrix — mini (Haiku-class), standard (Sonnet-class), reasoning (Opus-class) with tool allowlists.
  3. Mid-run escalation — tool mismatch and double-guardrail-fail triggers; context handoff summarizer.
  4. FinOps dashboards — false-mini alerts; weekly policy review with support ops.

Results after 30 days: overspend on simple intents 52% → 18%; blended unit cost $0.084 → $0.049; CSAT 4.31 → 4.28 (within noise); false-mini rate 0.9% (target < 1%); p95 routing overhead 28 ms. Escalation rate stabilized at 7.2% of mini starts — mostly legitimate write-path detection.

Technique decision table

ApproachStrengthWeaknessBest fit
Intent routing + tier matrix (this guide)Optimizes cost-quality proactively; scales traffic mixClassifier maintenance; policy complexityMulti-intent agents at volume
Single flagship modelSimplest; highest ceiling per run52% overspend in Harbor DeskLow volume, uniform tasks
Fallback ladder onlySurvives outagesDoes not reduce steady-state costPair with intent routing
LLM judge on every requestFlexible routing rationaleLatency + cost eats savingsBorderline band only (~10%)
User-selected model tierTransparent UXUsers over-select premium; abuseDeveloper tools, not support bots
Random cascade (try cheap, retry expensive)No classifier infraDuplicate work; bad UX on retriesBatch offline eval, not live chat

Pitfalls

  • Routing write tools to mini tier — the highest-impact false economy; block at policy layer, not hope the model refuses.
  • Classifier trained on stale intents — new product flows drift labels; monthly retrain minimum.
  • No mid-run escalation — upfront wrong tier becomes a hard failure; users churn.
  • Escalation without context handoff — replaying full chat on premium model doubles tokens.
  • Optimizing cost alone — CSAT and task success must guardrail tier promotion; FinOps without quality metrics backfires.
  • Per-message tier switching — voice and reasoning consistency break; pin tier per run or session.
  • Shadow routing never promoted — teams run classifiers in log-only mode forever; set a promotion deadline with metric gates.
  • Ignoring locale and channel — email forwards look complex but may be simple; widget one-liners may hide multi-step needs.

Production checklist

  • Define intent taxonomy aligned to tool allowlists and SLA tiers.
  • Ship rules engine for high-confidence fast paths before ML classifier.
  • Train stage-2 classifier on human-reviewed production traces; hold out weekly eval set.
  • Reserve LLM judge for borderline confidence band with hard timeout.
  • Document tier matrix: model id, context cap, tools, steps, cost multiplier.
  • Encode routing policies as versioned config with shadow and canary promotion.
  • Implement mid-run escalation triggers and structured context handoff.
  • Cap escalations per run; log escalation reason on audit span.
  • Wire routing fields into cost attribution and CSAT-by-tier dashboards.
  • Alert on false-mini rate and classifier confidence drift.
  • Regression-test golden set on every classifier or policy deploy.
  • Document user-visible behavior when tier promotes (optional transparency copy).

Key takeaways

  • Model routing optimizes steady-state cost-quality; fallback routing optimizes outage resilience — you need both.
  • Three-stage classification (rules, ML, borderline judge) keeps routing latency under 40 ms p95.
  • Mid-run escalation recovers from wrong upfront tier without failing the user.
  • Write-tool paths must never default to mini tier — policy enforcement beats prompt pleading.
  • Harbor Desk cut overspend on simple intents from 52% to 18% and blended unit cost 41% with intent routing, tier matrix, and escalation — CSAT unchanged.

Related reading