Guide

LLM agent SLA and deadline enforcement systems explained

Harbor Desk, a B2B customer-support agent platform, marketed an 8-second p95 reply SLA on its pricing page. Engineering measured SLA compliance on the first model call only — a 2.1-second median — and declared victory. Production told a different story: agents routinely chained three CRM tools (account lookup, ticket history, refund eligibility) before drafting a reply. Each tool carried its own 12-second HTTP timeout with no parent budget. End-to-end latency p95 hit 34 seconds while the chat widget showed a generic spinner with no partial text. Enterprise customers logged 51% of sessions as SLA misses in their own APM overlays, even though Harbor's internal dashboard reported 94% “model SLA” compliance. Churn risk spiked on the highest-paying tier.

The fix was not faster models. Harbor introduced end-to-end deadline tokens minted at ingress, tiered SLO profiles (interactive, standard, async), soft cutoffs that stream partial answers, and hard cutoffs that cancel tool trees cooperatively via the cancellation lifecycle. Misses on the customer-facing 8-second SLO fell from 51% to 6.4% within one release train. This guide explains what SLA deadline enforcement is, how it differs from queue admission and per-call timeouts, budget propagation across tools and subagents, degradation ladders, the Harbor Desk refactor, a decision table, pitfalls, and a production checklist.

What SLA deadline enforcement does

An SLA deadline enforcement system guarantees that a run respects a customer-visible latency contract from HTTP ingress through final token or terminal error — not just the LLM forward pass. It answers: how much time remains for this entire agent trajectory, and what should happen when the budget is exhausted?

Three layers often get conflated:

Admission control — whether a run starts now or waits in a priority queue (queue wait is part of SLA unless excluded explicitly).
Per-step timeouts — max duration for one model call or tool HTTP request.
End-to-end deadline — wall-clock budget for the whole run, decremented by every step including queue wait, retrieval, tools, and subagent delegation.

Without layer three, per-step timeouts sum to far more than the marketed SLA. Three tools at 12 seconds each can consume 36 seconds even when each step “passed” its local timeout check.

Tiered SLO profiles

Not every run deserves the same budget. Production systems define SLO profiles attached at ingress from API key, route, or signed client metadata:

Profile	Typical budget	User expectation
Interactive / P0	3–10 s	Chat widget, copilot inline assist
Standard / P1	15–45 s	Support triage, form-backed workflows
Async / P2	2–15 min	Webhook callback, email draft, batch enrich
Offline / P3	Hours	Indexing, eval sweeps, cron rebuilds

Each profile specifies: total deadline, whether queue wait counts, minimum reserve for final synthesis, and which degradation ladder applies at 50%, 80%, and 100% budget consumption. Profiles must align with tenant plan tiers so enterprise keys cannot inherit free-tier budgets by mistake.

Deadline tokens and propagation

At admission, the gateway mints a deadline token: { run_id, deadline_unix_ms, profile, remaining_ms }. Every downstream hop — model router, RAG retriever, tool executor, subagent worker — receives the token and must:

Read remaining_ms before starting work.
Set local timeout to min(step_default, remaining_ms - reserve).
Return unused budget or charge actual elapsed time back to the parent.
Refuse new tool calls when remaining_ms < tool_floor (e.g. 800 ms is not enough for a CRM round-trip).

Subagents spawned via delegation receive a sub-budget carved from the parent, never the full parent clock. Parent synthesis reserves typically 20–35% of the total budget so the orchestrator can still draft an answer after tools return or fail.

Propagation must be cooperative: hard kills mid-tool without cleanup orphan CRM writes. Pair deadlines with cancel tokens from the cancellation FSM so tools abort in-flight HTTP and roll back partial side effects where possible.

Soft vs hard cutoffs

Soft cutoff (warning threshold, often 70–85% budget): stop scheduling new tools, skip optional retrieval, switch to a smaller model tier, or begin streaming a partial answer with a visible “still checking…” footer. Users perceive progress; SLA clock may still be running but perceived latency improves.

Hard cutoff (100% budget): cancel pending tools, finalize with best-effort context, return structured DEADLINE_EXCEEDED with whatever tokens were already streamed. Never leave the client hanging until TCP idle timeout.

For streaming delivery, soft cutoff should flush buffered tokens immediately; hard cutoff sends a terminal SSE event with finish_reason: deadline and links to async continuation if the profile allows webhook completion.

Degradation ladder

A practical ladder for interactive profiles:

50% budget — drop low-value tools (sentiment scoring, duplicate search); cap RAG chunks to top-3.
70% budget — route to faster model; disable reflection loops from self-critique pipelines.
85% budget (soft) — stream draft answer from current context; mark tools as skipped in the trace.
100% budget (hard) — stop all side effects; append honest limitation text; offer async follow-up for P1/P2.

Ladders should be deterministic per profile version, logged in the audit trail, and covered by golden tests so marketing SLAs match runtime behavior after prompt or tool registry changes.

Measuring SLA correctly

Harbor Desk's false 94% compliance came from measuring the wrong interval. Correct metrics:

SLA start — request received at edge (or job admitted for async), not when worker picks up the run.
SLA end — last user-visible token or terminal error frame, not last model log line.
Attribution — slice misses by queue wait, retrieval, tool name, model tier, and subagent depth.
Exclusions — document client disconnects and user-initiated cancels separately; do not count them as passes.

Emit OpenTelemetry spans with sla.profile, sla.budget_ms, sla.remaining_at_complete, and sla.degradation_step so SRE can distinguish “need more GPUs” from “need shorter tool chains.”

Harbor Desk refactor

Root causes beyond the single-interval metric:

No parent deadline — tools used static 12 s timeouts.
Sequential tool fan-out — planner always called three CRM tools even when the first returned enough context.
No streaming until final — users waited for the full chain before seeing text.
Queue wait excluded — P0 bursts added 4–9 s invisible delay.

Shipped fixes:

Edge gateway mints deadline tokens; queue wait debits budget on admit.
Tool planner receives remaining_ms and selects a minimal tool set; parallel fan-out when budget allows.
Soft cutoff at 6 s streams draft reply; hard cutoff at 8 s for P0.
Async P1 webhook path for runs that hit hard cutoff with continuation_token.
Dashboard rebuilt on end-to-end span; model-only chart retained as diagnostic, not SLA source of truth.

Customer-measured p95 latency dropped from 34 s to 7.2 s. SLA miss rate fell from 51% to 6.4%. Partial streaming cut perceived wait by 40% even on runs that eventually used the full budget.

Technique decision table

Approach	Best for	Weak when
Per-call timeout only	Single-step chat, no tools	Multi-tool or multi-agent chains
Priority queue admission	Protecting P0 from batch floods	Runs start on time but exceed end budget
End-to-end deadline token	Contractual latency SLAs, tool-heavy agents	Offline jobs with no user wait
Soft cutoff + partial stream	Interactive UX under tight budgets	Regulated outputs requiring full verification
Async continuation webhook	P1 workflows that sometimes need depth	Users expect fully synchronous chat
Smaller model on degradation	Latency recovery without cancel	Quality floor is non-negotiable

Common pitfalls

SLA measured on model only — tools and queue wait hidden; false green dashboards.
Sum of per-step timeouts > SLA — each step passes locally while the user waits 30+ seconds.
No synthesis reserve — tools consume 100% budget; orchestrator has zero ms to draft.
Hard kill without cancel propagation — orphaned writes and duplicate retries.
Same budget for all tenants — enterprise SLA on free-tier clock.
Degradation ladders not in CI — prompt deploy re-enables expensive tools silently.
Ignoring client RTT — mobile regions need edge streaming earlier than datacenter lab tests suggest.

Production checklist

Define SLO profiles with total budget, queue-wait policy, and reserves.
Mint deadline tokens at ingress; propagate to every tool and subagent.
Set each step timeout to min(default, remaining - reserve).
Implement soft and hard cutoffs with streaming partial results.
Log degradation step and remaining budget on every run completion.
Measure SLA on end-to-end user-visible latency, not model latency alone.
Align profiles with tenant plan tiers and signed API metadata.
Pair hard cutoff with cooperative cancellation and idempotent tools.
Offer async continuation for profiles that allow webhook completion.
Alert when p95 remaining budget at tool start < 20% for P0.
Load-test chained tools under P0 budget, not isolated model calls.

Key takeaways

SLA enforcement is an end-to-end wall-clock contract, not a per-model timeout.
Deadline tokens propagate remaining budget across tools, retrieval, and subagents.
Soft cutoffs improve perceived latency; hard cutoffs prevent hung clients.
Admission control and rate limits do not replace deadline enforcement.
Harbor Desk cut SLA misses from 51% to 6.4% by measuring and enforcing the full trajectory.