Guide
LLM agent streaming response delivery systems explained
Harbor Support shipped a tier-one agent that could look up orders, issue refunds, and draft replies — but the web client waited for the entire run to finish before rendering anything. Median time-to-first-byte was 8.2 seconds because the orchestrator blocked on three sequential tool calls before returning JSON. Session analytics showed 34% of users abandoned the chat during that silent window; mobile was worse at 41%. Product assumed the model was slow; tracing showed users simply could not tell whether anything was happening.
Streaming response delivery for LLM agents means pushing typed, incremental events to the client as the run progresses — token deltas for assistant text, explicit frames for tool start/end, heartbeat pings, and terminal status — instead of one opaque HTTP response at the end. This guide covers transport choice, event schemas, orchestration integration with cancellation and timeouts and observability spans, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
Why agents need more than token streaming
Chat-completions streaming solves half the problem: users see words appear while the model decodes. Agent runs add long silent gaps while tools execute, subagents delegate, or retrieval indexes warm up. A UI that only streams final assistant tokens still feels broken when the model is not generating text at all.
Production agent streaming therefore multiplexes several event families on one connection:
- Run lifecycle —
run.started,run.step,run.completed,run.failedwith stable run and step IDs for tracing. - Assistant text —
message.deltachunks tied to a message ID so clients can append without repainting. - Tool visibility —
tool.startedwith sanitized name and argument summary;tool.completedwith truncated observation preview (full payload stays server-side per observation budgets). - Human gates —
approval.requiredwhen a dangerous tool blocks on operator consent. - Keepalive — comment pings or empty heartbeat events every 15–30s so proxies do not close idle connections during slow sandboxes.
The goal is perceived progress, not premature commitment. Clients
should render provisional UI; the server remains authoritative until
run.completed ships the final structured result.
Transport layer: SSE, WebSocket, and chunked HTTP
Server-Sent Events (SSE)
SSE over HTTP/1.1 or HTTP/2 is the default for browser-facing agents.
One directional server→client stream, automatic reconnect in the
EventSource API, simple load-balancer compatibility, and native support
in many API gateways. Encode each logical event as event: +
data: JSON lines. Set Cache-Control: no-cache,
disable buffering at nginx (X-Accel-Buffering: no), and cap
connection duration below proxy idle timeouts or rotate with resume tokens.
WebSocket
Prefer WebSockets when the client must send cancel signals, approval clicks, or mid-run corrections on the same socket without opening a second POST channel. Bidirectional traffic costs more operational complexity: sticky sessions, connection limits per pod, and explicit ping/pong heartbeats. Use for embedded copilots and desktop apps; SSE plus a separate cancel POST is often enough for read-mostly support bots.
Chunked JSON lines
NDJSON over chunked transfer works well for CLI and server-to-server consumers. Same event schema as SSE; only the framing differs. Avoid mixing formats per endpoint — one canonical envelope simplifies client SDKs and middleware hooks that tap the outbound stream.
Event schema and ordering guarantees
Define a versioned envelope every consumer understands:
{
"v": 1,
"seq": 42,
"run_id": "run_8f3a",
"type": "message.delta",
"ts": "2026-06-12T12:00:01.234Z",
"payload": { "message_id": "msg_1", "text": "Checking " }
}
Monotonic seq per run lets clients detect gaps after
reconnect. The orchestrator should emit from a single writer goroutine
per run to preserve order; parallel tool execution still serializes
outbound events through that writer. Never interleave unrelated runs on
one SSE connection — one run per stream, or multiplex with explicit
run_id filtering on a shared admin feed.
Terminal events must be idempotent: repeat run.completed on
reconnect with the same payload hash so clients that missed the first
close still finalize state. Pair with durable run records from
checkpointing
so late joiners can hydrate from storage if the stream already ended.
Backpressure, buffering, and slow clients
Unbounded in-memory buffers per connection will OOM your API tier when mobile clients background the tab. Track writable bytes (or channel depth) per subscriber:
- Coalesce token deltas — batch sub-50ms text chunks server-side before write; cuts event count 5–10× with negligible UX loss.
- Drop policy — for analytics-only taps, sample events; for user-visible streams, apply backpressure and pause the model provider stream when buffer exceeds threshold (provider APIs usually support pause/resume).
- Hard cap — close connections that fall more
than N seconds behind; client reconnects with
Last-Event-IDorsince_seqquery param.
Rate-limit concurrent open streams per user and IP through your existing quota layers so scrapers cannot hold thousands of idle SSE sockets.
Cancellation, errors, and partial delivery
Users click Stop mid-stream. Wire cancel to the same run ID the SSE
connection carries: client sends POST /runs/{id}/cancel or a
WebSocket cancel frame; orchestrator aborts in-flight provider
streams and tool sandboxes, then emits run.cancelled with the
last committed sequence number. Do not leave connections hanging without a
terminal event — proxies and client SDKs depend on explicit closure.
On tool failure, stream tool.failed with a redacted error
code before the model recovery turn so the UI shows which step broke.
If the model retries internally, emit a new run.step rather
than rewriting history; clients append a timeline instead of mutating
past rows.
Harbor Support refactor (worked example)
Harbor’s first agent API returned 200 with a complete
JSON blob. Support reps on hotel Wi-Fi saw frozen spinners through order
lookup (2.1s), inventory check (3.4s), and refund eligibility (2.7s)
before any assistant text. Abandon telemetry blamed “model
latency.”
The rebuild introduced:
- SSE endpoint
GET /runs/{id}/eventsopened immediately afterPOST /runsreturnedrun_id. - Instant
run.startedplustool.startedframes with human labels (“Looking up order #48291”) instead of raw JSON arguments. - Token streaming only on the final drafting step; earlier steps showed progress bars, not partial model output.
Last-Event-IDreplay for 60s after disconnect.- Cancel button wired to the shared cancellation service.
Median time-to-first-visible-progress dropped from 8.2s to 180ms. Session abandon fell from 34% to 6%. Median handle time unchanged — streaming did not make tools faster, it made waits tolerable.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Browser support chat | SSE + cancel POST | Blocking JSON until run ends |
| IDE copilot with edits | WebSocket bidirectional | Polling run status |
| CLI / batch worker | NDJSON chunked HTTP | SSE without reconnect story |
| Long sandbox tools (30s+) | Tool progress + heartbeats | Silent gap then text dump |
| Regulated transcripts | Stream + durable event log | Ephemeral-only buffers |
| High fan-out dashboards | Sampled read-only SSE | Full token mirror per admin |
| Mobile flaky networks | Seq replay + terminal idempotency | Assume single uninterrupted socket |
Common pitfalls
- Streaming raw tool arguments — leaks PII and confuses users; sanitize labels server-side.
- No terminal event on error — clients spin
forever; always emit
run.failedorrun.cancelled. - Proxy buffering — nginx batches SSE until buffer fills; disable buffering on agent routes.
- Mixing runs on one connection — racey UI updates when tabs share a session.
- Unbounded seq replay — reconnect storms replay megabytes; cap replay window and fall back to snapshot fetch.
- Streaming before authZ — open SSE only after run ownership is verified.
- Treating deltas as commits — business actions still require tool success events, not partial text.
Production checklist
- Versioned event envelope with monotonic
seqper run. - SSE (or WebSocket) opens before first tool executes; immediate progress frame.
- Separate event types for text, tools, approvals, errors, and terminal status.
- Disable reverse-proxy buffering; send keepalive during long tools.
- Coalesce token deltas; apply per-connection backpressure caps.
- Cancel endpoint aborts provider stream and sandboxes; emit terminal event.
- Support
Last-Event-ID/since_seqreplay with TTL. - Persist final run snapshot for clients that connect after stream ends.
- Redact tool args and observations in outbound events.
- Trace stream write latency and abandon rate alongside model TTFT.
Key takeaways
- Agent streaming is progress streaming — tools need visible frames, not just tokens.
- SSE is the browser default; WebSockets when bidirectional control is first-class.
- Ordering and terminal idempotency make reconnect safe on flaky networks.
- Backpressure protects the API tier from slow or abandoned tabs.
- Harbor Support cut abandon from 34% to 6% without speeding up tools — only by showing work earlier.
Related reading
- LLM agent cancellation, timeout and lifecycle explained — stop signals tied to run IDs
- LLM agent observability and tracing explained — span timing for TTFT and tool gaps
- LLM agent middleware hook pipeline explained — tap outbound events once
- LLM agent rate limiting and throttling explained — cap concurrent open streams