Guide

LLM agent run priority queue and admission control systems explained

Harbor Analytics ships a conversational BI agent: users ask natural-language questions, the runtime plans SQL, retrieves schema context, and streams an answer with chart suggestions. Nightly cron jobs also trigger bulk re-indexing of 40,000 table descriptions and embedding refresh across every tenant warehouse. On Tuesday evenings both workloads shared one unbounded worker pool. Interactive dashboard queries queued behind six-hour batch jobs; p95 time-to-first-token hit 38 seconds against a 4-second SLA. FinOps logged 47% of P0 runs as SLA misses that week — not because models were slow, but because nothing decided which run should start when capacity was scarce.

Run priority queues and admission control answer that decision at the front door: accept, defer, shed, or downgrade before tokens burn and tools fire. This is distinct from rate limiting (shape request velocity over time) and from circuit breakers (isolate failing dependencies). Admission control asks: “given current load, should this run enter the fleet now, later, or not at all?” Harbor Analytics split lanes, capped batch concurrency, and added weighted fair queuing per tenant. SLA misses on interactive traffic fell from 47% to 5.8% within one release cycle. This guide covers the scheduler stack, priority tiers, admission policies, queue discipline, the Harbor refactor, a decision table versus throttling alone, pitfalls, and a production checklist.

Admission control vs rate limiting vs batching

Three layers often confuse teams because all three reduce overload:

  • Rate limiting — token buckets and quotas cap requests per second per tenant, user, or API key. It prevents sustained floods but does not prioritize an urgent run over a background job that already passed the limit.
  • Admission control — evaluates each run at ingress against global concurrency, queue depth, budget headroom, and priority class. Can reject with 503 + Retry-After, enqueue with position, or route to a degraded tier.
  • Inference batching — groups compatible model calls after admission, inside the GPU scheduler. See inference batching for coalescing windows and continuous batching.

Mature fleets chain them: rate limit at the edge, admit into priority queues, batch what reaches the model gateway.

Priority lane model (P0–P3)

Assign every run a priority class at enqueue time. Classes should be few, explicit, and mapped to product SLAs:

P0 — interactive / revenue-critical

  • Live chat, in-app copilots, checkout assistants, on-call paging bots.
  • Target: sub-second queue wait; never blocked by P2/P3 without emergency override.
  • Preempts lower lanes only when configured; preemption must cancel tool side-effects safely via cooperative cancellation.

P1 — near-real-time

  • Webhook-triggered workflows, human-in-the-loop approvals with 30–120 s deadlines, streaming notifications.
  • May wait behind P0 bursts but must not sit behind batch re-index jobs.

P2 — standard async

  • Email triage, document summarization, non-urgent cron triggers with hour-scale SLAs.
  • Elastic concurrency; fair-share across tenants.

P3 — bulk / offline

  • Embedding rebuilds, eval sweeps, backfills, shadow traffic for canary deploys.
  • Runs only when P0–P2 queue depth and GPU headroom allow; first candidate for shedding under stress.

Encode lane in the run envelope (priority_class, sla_deadline_ms, tenant_id) so schedulers never infer priority from prompt content.

Admission gate policies

The admission gate sits between API ingress and worker dispatch. Common policy dimensions:

Concurrency caps

  • Global in-flight cap — max simultaneous runs across the fleet; protects shared model gateways and tool pools.
  • Per-tenant cap — prevents one customer from occupying the entire pool; pairs with tenant isolation.
  • Per-lane cap — e.g. max 200 P3 jobs even if GPUs are idle, reserving headroom for P0 spikes.

Queue depth and wait budgets

  • Reject or shed when queue_depth > max_depth for a lane.
  • Compute estimated_wait_ms from historical service rate; if wait exceeds sla_deadline_ms - p95_run_duration, defer with 429 or route to async callback.

Token and cost budgets

  • Check per-run and per-tenant remaining budget from the cost ledger before dispatch.
  • Hard-stop P3 when monthly burn exceeds 90% of plan; never hard-stop P0 without explicit operator flag.

Shed vs defer vs degrade

ActionWhenClient signal
DeferCapacity full but SLA allows wait202 + queue position + poll URL
ShedQueue depth or wait exceeds SLA503 + Retry-After
DegradeP0 pressure; P2/P3 still requestedRun on smaller model tier via fallback ladder
RejectInvalid priority, budget exhausted, policy block402 or 403 with reason code

Weighted fair queuing across tenants

A single global FIFO queue lets a noisy neighbor stall everyone. Use weighted fair queuing (WFQ) or deficit round robin per tenant_id:

  • Each tenant receives a weight from plan tier (enterprise 8, growth 4, free 1).
  • Scheduler picks the tenant with the lowest virtual_finish_time, then dequeues one run from that tenant’s sub-queue.
  • Within a tenant, still respect P0 > P1 > P2 > P3 ordering.

WFQ does not replace per-tenant hard caps — a tenant at weight 8 can still monopolize if caps are absent. Combine weights with max concurrent runs and max queue depth per tenant.

Run lifecycle FSM at the queue boundary

Expose explicit states so clients, observability, and traces agree on what happened:

  1. RECEIVED — request validated, idempotency key recorded.
  2. ADMITTED — passed gate; assigned lane and queue position.
  3. QUEUED — waiting for worker slot; emit heartbeat with position and eta_ms.
  4. DISPATCHED — worker claimed run; tools and model calls begin.
  5. SHED / REJECTED — terminal without execution; include structured reason for dashboards.

Persist transitions in the run store so retries after 503 do not double-charge or duplicate side effects when paired with idempotency keys.

Harbor Analytics refactor

The incident above traced to three design gaps:

  1. Single FIFO queue for chat and embedding rebuilds.
  2. No P3 concurrency cap — cron fired 400 parallel re-index runs at 02:00 UTC.
  3. No admission signal to clients — dashboards hung until timeout instead of returning 202 with ETA.

The fix shipped in layers:

  • Four lane-specific queues backed by Redis streams with consumer groups.
  • Admission gate: P0 max wait 500 ms; shed P3 when P0 queue depth > 20.
  • P3 hard cap of 40 concurrent runs; remainder deferred to off-peak windows.
  • WFQ per tenant on P1/P2 with enterprise weight 6 vs free weight 1.
  • Streaming UI shows queue position; async webhook for P2 completions.

Interactive SLA misses dropped from 47% to 5.8%. P3 jobs completed 20% later on average — acceptable for offline indexing. GPU cost was flat because the fleet stopped thrashing on pointless queue churn.

Technique decision table

ApproachBest forWeak when
Rate limiting aloneSteady QPS caps, provider quota complianceMixed-priority bursts inside the limit
Priority queues + admissionInteractive vs batch coexistence, SLA tiersSingle-tenant, uniform workload
Weighted fair queuingMulti-tenant fairness, plan-tier differentiationTwo tenants only; overhead not justified
Circuit breakersFailing downstream dependenciesHealthy deps but saturated workers
Inference batchingGPU utilization after admissionQueue wait dominates before GPU
Horizontal pod autoscalingStateless API tier growthModel gateway or tool pool is the bottleneck

Common pitfalls

  • Inferring priority from prompt text — attackers or noisy prompts jump the queue; require signed lane metadata.
  • One queue for all models — a slow 70B tier blocks fast 8B triage; partition by model pool.
  • Unbounded defer loops — P2 jobs retry forever; cap attempts and route to dead letter queue.
  • Missing queue-wait metrics — teams scale GPUs while p95 delay is 30 s of waiting, not inference.
  • Preemption without cancellation — shed runs keep burning tools; propagate cancel tokens.
  • Fair queuing without hard caps — one tenant still floods; weights need ceilings.
  • Shedding without client contract — opaque timeouts cause duplicate submits; return structured Retry-After.

Production checklist

  • Define P0–P3 classes with documented SLA and max queue wait per class.
  • Implement admission gate before worker dispatch, not inside the agent loop.
  • Enforce global, per-lane, and per-tenant concurrency caps.
  • Add WFQ or deficit round robin for multi-tenant P1/P2 queues.
  • Emit queue position, estimated wait, and lane in API responses.
  • Record ADMITTED / QUEUED / SHED transitions in the run audit trail.
  • Pair admission with idempotency keys on client retry.
  • Cap P3 concurrency; schedule bulk work in off-peak windows.
  • Alert when P0 queue wait p95 exceeds 10% of SLA budget.
  • Load-test mixed P0 burst + P3 batch, not homogeneous traffic.
  • Reconcile queue-wait time vs model latency before buying more GPUs.

Key takeaways

  • Admission control decides whether a run starts now, later, or never.
  • Priority lanes protect interactive traffic from batch and cron floods.
  • Weighted fair queuing spreads capacity across tenants without pure FIFO starvation.
  • Rate limits shape velocity; queues arbitrate scarcity under mixed SLAs.
  • Harbor Analytics cut interactive SLA misses from 47% to 5.8% with queue discipline alone.

Related reading