Guide

LLM intent classification and query routing explained

Harbor Support's first AI triage bot treated every message the same: embed the question, retrieve twelve policy chunks, generate an answer. “Where is my refund?” and “Explain your GDPR data-retention schedule for EU subprocessors” both paid for a full RAG pass. Lookup queries hallucinated order IDs pulled from unrelated chunks. Policy deep-dives sometimes skipped retrieval entirely when the model guessed from pretraining. Median latency was 4.2 seconds; agents escalated 34% of bot threads anyway.

The refactor added an intent classification layer upstream: a fast classifier labels each message (lookup, policy research, action request, chitchat, abuse) and routes it to a purpose-built pipeline — direct function call for order status, RAG only for research intents, canned FAQ for pricing basics, human queue for disputes. Median latency dropped to 1.6 seconds; status-query hallucinations fell from 19% to under 1%; RAG token spend fell 48%. This guide covers intent taxonomies, classifier techniques, confidence gates and fallbacks, mapping intents to pipelines, the Harbor Support refactor, a technique decision table versus one-pipeline-for-everything, pitfalls, and a production checklist.

What intent classification does (and what it is not)

Intent classification assigns a structured label to a user message before the expensive work begins. It answers: what kind of task is this? That is different from model routing, which picks which model serves a task you already chose. Intent routing picks which pipeline — RAG, tool execution, template reply, retrieval-free generation, or human handoff.

Good intent layers are cheap (milliseconds, sub-cent), logged per request, and retrainable from production misroutes. They sit after input guardrails and before retrieval or tool selection. A single user turn can carry multiple intents (“Cancel order 8842 and tell me your refund policy”) — your schema must say whether you split, prioritize, or ask a clarifying question.

Intent taxonomy for production assistants

Taxonomies should be small, mutually exclusive at routing time, and aligned to pipelines you actually built. Ten labels that map to one handler are useless.

Intent class User goal Typical pipeline
Lookup / status Fetch a specific record (order, ticket, balance) Auth check + function call or SQL; no RAG
Policy / research Explain rules, compare options, cite sources RAG with citations + optional reranker
Action / mutation Cancel, refund, update, file a claim Tool call with confirmation step + idempotency
FAQ / navigational Hours, pricing tiers, how to reset password Template or retrieval-free small model
Conversational Greetings, thanks, small talk Short direct generation; skip retrieval
Clarification needed Ambiguous entity, missing ID, conflicting asks Ask-back prompt; do not guess
Escalation / sensitive Legal threat, self-harm, regulatory complaint Human queue + policy hold; block automation

Domain-specific intents (billing dispute vs shipping delay) matter when each has a different tool surface or compliance rule. Start with six to eight classes; split only when misroute cost justifies new training data.

Classifier techniques

Rule and keyword routers

Regular expressions, keyword lists, and metadata triggers (channel = SMS, user tier = enterprise) are fast and auditable. Use them for high-precision paths: if the message matches order\s*#?\d{5,}, route to order lookup without a model call. Rules fail on paraphrase (“what happened to the thing I bought Tuesday”) so pair them with a learned backstop.

Embedding similarity classifiers

Embed the user message and compare against labeled exemplar centroids or a embedding index of canonical queries per intent. Cheap at scale; works well when intents are semantically separated. Tune distance thresholds per class — FAQ can be loose; action intents should be strict to avoid accidental refunds.

Small supervised models

Fine-tune a compact encoder (DeBERTa, ModernBERT, distilled BERT) on a few thousand labeled tickets. Inference stays under 50 ms on CPU. This is the workhorse for support bots with stable intent sets. Export confusion matrices weekly; merge classes that operators cannot distinguish.

LLM-as-classifier

A structured-output call (“return JSON: {"intent":"...","confidence":0.0-1.0}”) handles edge cases and multi-intent decomposition. Costlier and slower — reserve for low-confidence band (0.45–0.75) after a fast classifier, or for offline relabeling. Cache labels for repeated phrasing via semantic cache.

Mapping intents to pipelines

The classifier is only useful if downstream handlers exist and enforce constraints:

  • Lookup — require authenticated session; call read-only APIs; never let the generator invent IDs. Return structured cards, not prose essays.
  • Research — run hybrid search, rerank, then generate with mandatory citations.
  • Action — two-step confirm for irreversible ops; use structured tool errors and idempotency keys.
  • FAQ — serve approved snippets; log drift when users rephrase into research territory.
  • Low confidence — ask a disambiguation question or offer buttons; do not default to RAG (“here is everything we know”).

Log the tuple (raw_text, predicted_intent, confidence, chosen_pipeline, outcome) for every turn. Misroutes become training rows; that closed loop matters more than squeezing another 0.5% accuracy on a static dev set.

Harbor Support triage refactor

Harbor's v2 router runs in three stages:

  1. Rules (0.1 ms) — order ID regex, tracking-number format, escalation keywords (“lawyer”, “attorney general”) to human queue.
  2. DeBERTa-v3-small (18 ms) — seven intents trained on 9,400 labeled tickets; abstain below 0.55 confidence.
  3. GPT-4o-mini classifier (band 0.55–0.75 only) — structured JSON intent + slot extraction for order IDs missed by regex.

Lookup intents hit the billing API via parallel tool calls when users bundle questions. Research intents use RAG with six chunks max. Action intents require explicit “Confirm cancel” UI. Chitchat bypasses retrieval entirely. After eight weeks, first-contact resolution rose 22%; average RAG cost per session fell from $0.09 to $0.047.

Technique decision table

Scenario Preferred approach Avoid
High-volume support bot, stable intents Rules + small supervised classifier + confidence abstain Full RAG on every message
Rapidly evolving product surface Embedding exemplar index + weekly relabel job Hard-coded keyword-only router
Multi-intent compound queries LLM decomposer → sequential pipeline per sub-intent Pick single winner intent silently
Regulated actions (refunds, medical) Strict action intent + human confirm + audit log Soft prompt (“please be careful”)
Open-domain creative assistant Light conversational vs tool-use split only Over-fitting 20 micro-intents
Low traffic, prototyping LLM-as-classifier with few-shot exemplars Building seven pipelines before traffic proves need

Common pitfalls

  • One pipeline for everything. RAG on lookups invites hallucinated IDs; skipping RAG on research invites stale pretraining answers.
  • Too many intents. Operators cannot label consistently; merge classes until inter-annotator agreement stabilizes.
  • No abstain path. Forcing a label at 0.41 confidence routes garbage to the wrong API.
  • Classifier evaluated offline only. Pipeline-level metrics (task success, citation accuracy) beat intent F1 alone.
  • Ignoring conversation context. “Yes, cancel it” needs prior-turn intent carryover from history management, not isolated classification.
  • Routing without auth gates. Lookup pipelines must verify the caller owns the record before tool execution.
  • Stale exemplar banks. Embedding routers drift when product vocabulary changes; refresh centroids after launches.

Production checklist

  • Intent schema maps 1:1 to implemented pipelines (no orphan labels).
  • Rule layer covers high-precision patterns (IDs, tracking numbers, escalation terms).
  • Fast classifier trained on production-like phrasing, not FAQ copy alone.
  • Per-intent confidence thresholds with explicit abstain / ask-back behavior.
  • LLM fallback band defined for ambiguous band only (cost-bounded).
  • Multi-intent policy documented: split, sequential, or clarify.
  • Auth and ACL checks enforced per pipeline before tools or RAG.
  • Structured logging: intent, confidence, pipeline, latency, outcome.
  • Weekly misroute review queue feeding relabel and retrain jobs.
  • End-to-end eval sets per intent (not just classifier accuracy).
  • Dashboards: misroute rate, abstain rate, cost per intent, escalation rate.
  • Runbook for adding a new intent without redeploying the entire stack.

Key takeaways

  • Intent routing picks the pipeline; model routing picks the model. You need both in serious deployments.
  • Keep taxonomies small and pipeline-aligned. Labels nobody acts on are noise.
  • Layer cheap classifiers first. Rules and small encoders catch most traffic.
  • Abstain beats misroute. A clarifying question is cheaper than a wrong refund.
  • Measure task success, not just F1. The right label into a broken handler still fails.

Related reading