Guide
LLM streaming responses explained
Harbor Support shipped a blocking chat API in early 2025: users typed a question, stared at a spinner for eight to twelve seconds, then received the full answer at once. Session abandonment on long replies hit 34%. The refactor switched every customer-facing surface to token streaming over Server-Sent Events (SSE), flushed the first delta within 400 ms p95, and piped partial text into a live markdown renderer. Abandonment dropped to 11% even though total generation time was unchanged. Perceived latency — not wall-clock completion — was the product bug.
Streaming delivers model output incrementally as tokens are decoded, rather than buffering the full completion server-side. It touches wire protocols, reverse-proxy buffering, client parsers, cancellation, and how tool calls interleave with visible text. This guide covers why streaming matters, SSE versus WebSocket versus chunked HTTP, server flush cadence and backpressure, client rendering patterns, ties to inference serving and observability, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
Why stream: time-to-first-token beats total latency
Human patience in chat UIs is measured in hundreds of milliseconds, not seconds. Time to first token (TTFT) — from request accepted to first visible character — dominates perceived responsiveness. A 2,000-token answer that starts rendering at 350 ms feels fast; the same answer delivered as one blob after 6 s feels broken even if both finish at the same instant.
Streaming also enables early user action: reading while the model continues, cancelling off-topic generations, and surfacing retrieval citations as soon as the model references them. For agent loops, partial tool-call JSON lets the UI show “Searching tickets…” before the function executes.
The trade-off is engineering complexity. You need idempotent chunk handling, reconnect semantics, and careful accounting so billing and logging still reflect the final token count when a user aborts mid-stream.
Wire protocols: SSE, WebSocket and chunked HTTP
Most production chat APIs expose streaming through one of three transports:
| Transport | Direction | Typical use | Pros | Cons |
|---|---|---|---|---|
| Server-Sent Events (SSE) | Server to client | OpenAI-compatible chat completions, Anthropic messages | Simple HTTP, auto-reconnect, works through most proxies | Unidirectional; separate POST for input |
| WebSocket | Bidirectional | Voice agents, multi-turn on one socket | Low overhead for many round trips | Proxy and load-balancer sticky-session pain |
| Chunked HTTP body | Server to client | Custom NDJSON streams | Minimal framing overhead | No standard reconnect; you define schema |
SSE frames are text lines: data: {"delta":"Hello"} terminated by a
blank line. Clients use fetch with a readable stream or
EventSource (GET-only). OpenAI-style APIs emit
data: [DONE] as the terminal event. Disable reverse-proxy buffering
(X-Accel-Buffering: no on nginx) or tokens batch into multi-second
lumps that destroy TTFT gains.
Server lifecycle: prefill, decode loop and flush cadence
A streaming request passes through the same stages as blocking inference, but the server yields after each decode step:
- Accept and validate — auth, rate limits, prompt token count against context window.
- Prefill — process the prompt in one or more forward passes; TTFT includes queue wait plus prefill time. Long RAG contexts dominate here; see KV cache sizing.
- Decode loop — sample one token, append to sequence, serialize delta, flush to socket. Repeat until EOS or max tokens.
- Terminal event — send finish reason, usage stats, and close stream.
Flush cadence balances syscall overhead against smooth UI updates.
Emitting every single token minimizes latency but increases CPU; batching two to
five tokens per flush is common for fast models. Never hold buffers across proxy
timeouts — send comment heartbeats (: ping in SSE) on idle
agent tool steps so connections stay alive during 10 s database lookups.
Backpressure appears when the client reads slower than the model
generates. TCP buffers absorb short bursts; sustained mismatch means memory growth
on the server. Honor AbortSignal / connection close: stop GPU decode
promptly and free KV slots in the
serving engine
so cancelled chats do not block concurrent users.
Client patterns: parsers, markdown and cancellation
Browser clients typically:
- Open a POST with
stream: true, readresponse.bodythrough aTextDecoderStream, and split on SSE event boundaries. - Accumulate
delta.contentstrings into a buffer; re-render markdown on animation frames (requestAnimationFrame) to avoid layout thrash. - Wire the Stop button to
AbortController.abort(), which closes the fetch and propagates cancel to the upstream inference worker when configured. - Handle out-of-order or duplicate chunks defensively — use monotonic sequence numbers if your API provides them.
For tool and function calls, providers stream partial JSON in
delta.tool_calls arrays. Buffer until arguments parse as valid JSON
before execution; display a provisional label from the function name field as soon
as it arrives. This pairs with
structured outputs
when you need schema-valid payloads before side effects.
Log final assembled text plus per-chunk timing on the client for debugging, but treat server-side usage records as the billing source of truth.
Harbor Support refactor: from blocking JSON to live SSE
The Harbor Support migration followed four steps:
- API gateway — enabled streaming on the chat route, disabled nginx buffering, set 120 s read timeout with SSE comment pings every 15 s during tool calls.
- UI shell — replaced spinner with a typing cursor;
throttled markdown re-parses to 60 fps; citations rendered inline when
delta.annotationsarrived. - Cancel path — Stop button aborted fetch; server mapped disconnect to inference cancel, cutting wasted tokens on abandoned threads by 28%.
- Observability — traced TTFT, inter-token latency p95, and cancel rate per model version in the ops dashboard, feeding the canary rollout gate (“no promote if TTFT regresses > 15%”).
Support CSAT rose 0.4 points with no model upgrade — evidence that delivery mechanics are product features, not infrastructure details.
Technique decision table
| Approach | TTFT perception | Complexity | Best when | Avoid when |
|---|---|---|---|---|
| SSE token stream | Excellent | Medium | Chat UIs, OpenAI-compatible clients | Binary payloads, sub-10 ms bidirectional |
| Blocking JSON | Poor | Low | Batch jobs, short answers, webhooks | Interactive chat, long completions |
| WebSocket multiplex | Excellent | High | Voice, live editing, many turns per socket | Simple stateless REST behind CDN |
| Simulated typing (fake stream) | Good | Low | Cached or templated replies | Truthful latency SLAs, agent tool steps |
| Server push + client poll hybrid | Moderate | Medium | Legacy mobile clients without streams | Greenfield web apps |
Common pitfalls
- Proxy buffering — nginx and Cloudflare batch SSE until the buffer fills; TTFT looks unchanged in server logs but users wait seconds.
- No cancel propagation — client closes tab but GPU keeps decoding; concurrency collapses under load.
- Markdown re-parse every token — main-thread jank on long answers; throttle renders.
- Executing partial tool JSON — malformed arguments cause spurious API calls; validate before side effects.
- Missing terminal event — clients hang if
[DONE]never arrives on error paths; always close with an error payload. - Ignoring reconnect — mobile networks drop SSE; decide whether to resume (hard) or show retry (simpler).
- Streaming PII — tokens hit browser memory immediately; apply the same redaction policy as blocking responses.
- Bill on partial streams — charge for tokens actually generated, including after user abort if your policy requires it; document clearly.
Production checklist
- Set p95 TTFT SLO (e.g. < 500 ms) and alert on regression per model version.
- Disable reverse-proxy buffering on streaming routes; verify with curl -N.
- Emit SSE heartbeats during long tool-call gaps.
- Map client disconnect and AbortSignal to inference cancel within 200 ms.
- Return usage block in terminal event; reconcile with observability traces.
- Throttle client markdown renders; measure main-thread long tasks.
- Buffer tool-call JSON until schema-valid; show provisional status labels.
- Handle error mid-stream with structured error events, not silent hang.
- Load-test concurrent streams against KV cache capacity, not just QPS.
- Include streaming routes in canary deploy gates alongside quality metrics.
Key takeaways
- Streaming optimizes perceived latency via low TTFT — total generation time can stay the same while UX improves dramatically.
- SSE over HTTP is the default for chat APIs; disable proxy buffering and flush decode steps promptly.
- Cancel propagation frees GPU memory; treat abort as a first-class inference signal, not a client-only nicety.
- Tool-call streams need JSON buffering before execution; structured output modes reduce parse risk.
- Harbor Support cut session abandonment by streaming tokens and measuring TTFT in observability-driven canary gates.
Related reading
- LLM inference serving explained — continuous batching, queues, and decode scheduling behind streams
- LLM observability explained — trace TTFT, token latency, and cancel rates in production
- LLM canary and shadow deployment explained — gate rollouts on streaming SLO regressions
- LLM structured outputs explained — safe parsing for streamed tool arguments