Guide

LLM streaming responses explained

Harbor Support shipped a blocking chat API in early 2025: users typed a question, stared at a spinner for eight to twelve seconds, then received the full answer at once. Session abandonment on long replies hit 34%. The refactor switched every customer-facing surface to token streaming over Server-Sent Events (SSE), flushed the first delta within 400 ms p95, and piped partial text into a live markdown renderer. Abandonment dropped to 11% even though total generation time was unchanged. Perceived latency — not wall-clock completion — was the product bug.

Streaming delivers model output incrementally as tokens are decoded, rather than buffering the full completion server-side. It touches wire protocols, reverse-proxy buffering, client parsers, cancellation, and how tool calls interleave with visible text. This guide covers why streaming matters, SSE versus WebSocket versus chunked HTTP, server flush cadence and backpressure, client rendering patterns, ties to inference serving and observability, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why stream: time-to-first-token beats total latency

Human patience in chat UIs is measured in hundreds of milliseconds, not seconds. Time to first token (TTFT) — from request accepted to first visible character — dominates perceived responsiveness. A 2,000-token answer that starts rendering at 350 ms feels fast; the same answer delivered as one blob after 6 s feels broken even if both finish at the same instant.

Streaming also enables early user action: reading while the model continues, cancelling off-topic generations, and surfacing retrieval citations as soon as the model references them. For agent loops, partial tool-call JSON lets the UI show “Searching tickets…” before the function executes.

The trade-off is engineering complexity. You need idempotent chunk handling, reconnect semantics, and careful accounting so billing and logging still reflect the final token count when a user aborts mid-stream.

Wire protocols: SSE, WebSocket and chunked HTTP

Most production chat APIs expose streaming through one of three transports:

Transport	Direction	Typical use	Pros	Cons
Server-Sent Events (SSE)	Server to client	OpenAI-compatible chat completions, Anthropic messages	Simple HTTP, auto-reconnect, works through most proxies	Unidirectional; separate POST for input
WebSocket	Bidirectional	Voice agents, multi-turn on one socket	Low overhead for many round trips	Proxy and load-balancer sticky-session pain
Chunked HTTP body	Server to client	Custom NDJSON streams	Minimal framing overhead	No standard reconnect; you define schema

SSE frames are text lines: data: {"delta":"Hello"} terminated by a blank line. Clients use fetch with a readable stream or EventSource (GET-only). OpenAI-style APIs emit data: [DONE] as the terminal event. Disable reverse-proxy buffering (X-Accel-Buffering: no on nginx) or tokens batch into multi-second lumps that destroy TTFT gains.

Server lifecycle: prefill, decode loop and flush cadence

A streaming request passes through the same stages as blocking inference, but the server yields after each decode step:

Accept and validate — auth, rate limits, prompt token count against context window.
Prefill — process the prompt in one or more forward passes; TTFT includes queue wait plus prefill time. Long RAG contexts dominate here; see KV cache sizing.
Decode loop — sample one token, append to sequence, serialize delta, flush to socket. Repeat until EOS or max tokens.
Terminal event — send finish reason, usage stats, and close stream.

Flush cadence balances syscall overhead against smooth UI updates. Emitting every single token minimizes latency but increases CPU; batching two to five tokens per flush is common for fast models. Never hold buffers across proxy timeouts — send comment heartbeats (: ping in SSE) on idle agent tool steps so connections stay alive during 10 s database lookups.

Backpressure appears when the client reads slower than the model generates. TCP buffers absorb short bursts; sustained mismatch means memory growth on the server. Honor AbortSignal / connection close: stop GPU decode promptly and free KV slots in the serving engine so cancelled chats do not block concurrent users.

Client patterns: parsers, markdown and cancellation

Browser clients typically:

Open a POST with stream: true, read response.body through a TextDecoderStream, and split on SSE event boundaries.
Accumulate delta.content strings into a buffer; re-render markdown on animation frames (requestAnimationFrame) to avoid layout thrash.
Wire the Stop button to AbortController.abort(), which closes the fetch and propagates cancel to the upstream inference worker when configured.
Handle out-of-order or duplicate chunks defensively — use monotonic sequence numbers if your API provides them.

For tool and function calls, providers stream partial JSON in delta.tool_calls arrays. Buffer until arguments parse as valid JSON before execution; display a provisional label from the function name field as soon as it arrives. This pairs with structured outputs when you need schema-valid payloads before side effects.

Log final assembled text plus per-chunk timing on the client for debugging, but treat server-side usage records as the billing source of truth.

Harbor Support refactor: from blocking JSON to live SSE

The Harbor Support migration followed four steps:

API gateway — enabled streaming on the chat route, disabled nginx buffering, set 120 s read timeout with SSE comment pings every 15 s during tool calls.
UI shell — replaced spinner with a typing cursor; throttled markdown re-parses to 60 fps; citations rendered inline when delta.annotations arrived.
Cancel path — Stop button aborted fetch; server mapped disconnect to inference cancel, cutting wasted tokens on abandoned threads by 28%.
Observability — traced TTFT, inter-token latency p95, and cancel rate per model version in the ops dashboard, feeding the canary rollout gate (“no promote if TTFT regresses > 15%”).

Support CSAT rose 0.4 points with no model upgrade — evidence that delivery mechanics are product features, not infrastructure details.

Technique decision table

Approach	TTFT perception	Complexity	Best when	Avoid when
SSE token stream	Excellent	Medium	Chat UIs, OpenAI-compatible clients	Binary payloads, sub-10 ms bidirectional
Blocking JSON	Poor	Low	Batch jobs, short answers, webhooks	Interactive chat, long completions
WebSocket multiplex	Excellent	High	Voice, live editing, many turns per socket	Simple stateless REST behind CDN
Simulated typing (fake stream)	Good	Low	Cached or templated replies	Truthful latency SLAs, agent tool steps
Server push + client poll hybrid	Moderate	Medium	Legacy mobile clients without streams	Greenfield web apps

Common pitfalls

Proxy buffering — nginx and Cloudflare batch SSE until the buffer fills; TTFT looks unchanged in server logs but users wait seconds.
No cancel propagation — client closes tab but GPU keeps decoding; concurrency collapses under load.
Markdown re-parse every token — main-thread jank on long answers; throttle renders.
Executing partial tool JSON — malformed arguments cause spurious API calls; validate before side effects.
Missing terminal event — clients hang if [DONE] never arrives on error paths; always close with an error payload.
Ignoring reconnect — mobile networks drop SSE; decide whether to resume (hard) or show retry (simpler).
Streaming PII — tokens hit browser memory immediately; apply the same redaction policy as blocking responses.
Bill on partial streams — charge for tokens actually generated, including after user abort if your policy requires it; document clearly.

Production checklist

Set p95 TTFT SLO (e.g. < 500 ms) and alert on regression per model version.
Disable reverse-proxy buffering on streaming routes; verify with curl -N.
Emit SSE heartbeats during long tool-call gaps.
Map client disconnect and AbortSignal to inference cancel within 200 ms.
Return usage block in terminal event; reconcile with observability traces.
Throttle client markdown renders; measure main-thread long tasks.
Buffer tool-call JSON until schema-valid; show provisional status labels.
Handle error mid-stream with structured error events, not silent hang.
Load-test concurrent streams against KV cache capacity, not just QPS.
Include streaming routes in canary deploy gates alongside quality metrics.

Key takeaways

Streaming optimizes perceived latency via low TTFT — total generation time can stay the same while UX improves dramatically.
SSE over HTTP is the default for chat APIs; disable proxy buffering and flush decode steps promptly.
Cancel propagation frees GPU memory; treat abort as a first-class inference signal, not a client-only nicety.
Tool-call streams need JSON buffering before execution; structured output modes reduce parse risk.
Harbor Support cut session abandonment by streaming tokens and measuring TTFT in observability-driven canary gates.