Guide

LLM agent streaming response delivery systems explained

Harbor Support shipped a tier-one agent that could look up orders, issue refunds, and draft replies — but the web client waited for the entire run to finish before rendering anything. Median time-to-first-byte was 8.2 seconds because the orchestrator blocked on three sequential tool calls before returning JSON. Session analytics showed 34% of users abandoned the chat during that silent window; mobile was worse at 41%. Product assumed the model was slow; tracing showed users simply could not tell whether anything was happening.

Streaming response delivery for LLM agents means pushing typed, incremental events to the client as the run progresses — token deltas for assistant text, explicit frames for tool start/end, heartbeat pings, and terminal status — instead of one opaque HTTP response at the end. This guide covers transport choice, event schemas, orchestration integration with cancellation and timeouts and observability spans, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why agents need more than token streaming

Chat-completions streaming solves half the problem: users see words appear while the model decodes. Agent runs add long silent gaps while tools execute, subagents delegate, or retrieval indexes warm up. A UI that only streams final assistant tokens still feels broken when the model is not generating text at all.

Production agent streaming therefore multiplexes several event families on one connection:

  • Run lifecyclerun.started, run.step, run.completed, run.failed with stable run and step IDs for tracing.
  • Assistant textmessage.delta chunks tied to a message ID so clients can append without repainting.
  • Tool visibilitytool.started with sanitized name and argument summary; tool.completed with truncated observation preview (full payload stays server-side per observation budgets).
  • Human gatesapproval.required when a dangerous tool blocks on operator consent.
  • Keepalive — comment pings or empty heartbeat events every 15–30s so proxies do not close idle connections during slow sandboxes.

The goal is perceived progress, not premature commitment. Clients should render provisional UI; the server remains authoritative until run.completed ships the final structured result.

Transport layer: SSE, WebSocket, and chunked HTTP

Server-Sent Events (SSE)

SSE over HTTP/1.1 or HTTP/2 is the default for browser-facing agents. One directional server→client stream, automatic reconnect in the EventSource API, simple load-balancer compatibility, and native support in many API gateways. Encode each logical event as event: + data: JSON lines. Set Cache-Control: no-cache, disable buffering at nginx (X-Accel-Buffering: no), and cap connection duration below proxy idle timeouts or rotate with resume tokens.

WebSocket

Prefer WebSockets when the client must send cancel signals, approval clicks, or mid-run corrections on the same socket without opening a second POST channel. Bidirectional traffic costs more operational complexity: sticky sessions, connection limits per pod, and explicit ping/pong heartbeats. Use for embedded copilots and desktop apps; SSE plus a separate cancel POST is often enough for read-mostly support bots.

Chunked JSON lines

NDJSON over chunked transfer works well for CLI and server-to-server consumers. Same event schema as SSE; only the framing differs. Avoid mixing formats per endpoint — one canonical envelope simplifies client SDKs and middleware hooks that tap the outbound stream.

Event schema and ordering guarantees

Define a versioned envelope every consumer understands:

{
  "v": 1,
  "seq": 42,
  "run_id": "run_8f3a",
  "type": "message.delta",
  "ts": "2026-06-12T12:00:01.234Z",
  "payload": { "message_id": "msg_1", "text": "Checking " }
}

Monotonic seq per run lets clients detect gaps after reconnect. The orchestrator should emit from a single writer goroutine per run to preserve order; parallel tool execution still serializes outbound events through that writer. Never interleave unrelated runs on one SSE connection — one run per stream, or multiplex with explicit run_id filtering on a shared admin feed.

Terminal events must be idempotent: repeat run.completed on reconnect with the same payload hash so clients that missed the first close still finalize state. Pair with durable run records from checkpointing so late joiners can hydrate from storage if the stream already ended.

Backpressure, buffering, and slow clients

Unbounded in-memory buffers per connection will OOM your API tier when mobile clients background the tab. Track writable bytes (or channel depth) per subscriber:

  • Coalesce token deltas — batch sub-50ms text chunks server-side before write; cuts event count 5–10× with negligible UX loss.
  • Drop policy — for analytics-only taps, sample events; for user-visible streams, apply backpressure and pause the model provider stream when buffer exceeds threshold (provider APIs usually support pause/resume).
  • Hard cap — close connections that fall more than N seconds behind; client reconnects with Last-Event-ID or since_seq query param.

Rate-limit concurrent open streams per user and IP through your existing quota layers so scrapers cannot hold thousands of idle SSE sockets.

Cancellation, errors, and partial delivery

Users click Stop mid-stream. Wire cancel to the same run ID the SSE connection carries: client sends POST /runs/{id}/cancel or a WebSocket cancel frame; orchestrator aborts in-flight provider streams and tool sandboxes, then emits run.cancelled with the last committed sequence number. Do not leave connections hanging without a terminal event — proxies and client SDKs depend on explicit closure.

On tool failure, stream tool.failed with a redacted error code before the model recovery turn so the UI shows which step broke. If the model retries internally, emit a new run.step rather than rewriting history; clients append a timeline instead of mutating past rows.

Harbor Support refactor (worked example)

Harbor’s first agent API returned 200 with a complete JSON blob. Support reps on hotel Wi-Fi saw frozen spinners through order lookup (2.1s), inventory check (3.4s), and refund eligibility (2.7s) before any assistant text. Abandon telemetry blamed “model latency.”

The rebuild introduced:

  1. SSE endpoint GET /runs/{id}/events opened immediately after POST /runs returned run_id.
  2. Instant run.started plus tool.started frames with human labels (“Looking up order #48291”) instead of raw JSON arguments.
  3. Token streaming only on the final drafting step; earlier steps showed progress bars, not partial model output.
  4. Last-Event-ID replay for 60s after disconnect.
  5. Cancel button wired to the shared cancellation service.

Median time-to-first-visible-progress dropped from 8.2s to 180ms. Session abandon fell from 34% to 6%. Median handle time unchanged — streaming did not make tools faster, it made waits tolerable.

Technique decision table

ScenarioPreferAvoid
Browser support chatSSE + cancel POSTBlocking JSON until run ends
IDE copilot with editsWebSocket bidirectionalPolling run status
CLI / batch workerNDJSON chunked HTTPSSE without reconnect story
Long sandbox tools (30s+)Tool progress + heartbeatsSilent gap then text dump
Regulated transcriptsStream + durable event logEphemeral-only buffers
High fan-out dashboardsSampled read-only SSEFull token mirror per admin
Mobile flaky networksSeq replay + terminal idempotencyAssume single uninterrupted socket

Common pitfalls

  • Streaming raw tool arguments — leaks PII and confuses users; sanitize labels server-side.
  • No terminal event on error — clients spin forever; always emit run.failed or run.cancelled.
  • Proxy buffering — nginx batches SSE until buffer fills; disable buffering on agent routes.
  • Mixing runs on one connection — racey UI updates when tabs share a session.
  • Unbounded seq replay — reconnect storms replay megabytes; cap replay window and fall back to snapshot fetch.
  • Streaming before authZ — open SSE only after run ownership is verified.
  • Treating deltas as commits — business actions still require tool success events, not partial text.

Production checklist

  • Versioned event envelope with monotonic seq per run.
  • SSE (or WebSocket) opens before first tool executes; immediate progress frame.
  • Separate event types for text, tools, approvals, errors, and terminal status.
  • Disable reverse-proxy buffering; send keepalive during long tools.
  • Coalesce token deltas; apply per-connection backpressure caps.
  • Cancel endpoint aborts provider stream and sandboxes; emit terminal event.
  • Support Last-Event-ID / since_seq replay with TTL.
  • Persist final run snapshot for clients that connect after stream ends.
  • Redact tool args and observations in outbound events.
  • Trace stream write latency and abandon rate alongside model TTFT.

Key takeaways

  • Agent streaming is progress streaming — tools need visible frames, not just tokens.
  • SSE is the browser default; WebSockets when bidirectional control is first-class.
  • Ordering and terminal idempotency make reconnect safe on flaky networks.
  • Backpressure protects the API tier from slow or abandoned tabs.
  • Harbor Support cut abandon from 34% to 6% without speeding up tools — only by showing work earlier.

Related reading