Guide
LLM voice agent pipeline explained
Harbor Clinic launched a phone intake bot in early 2026: patients described symptoms, the system transcribed speech, an LLM drafted triage questions, and a neural voice read replies aloud. Analytics showed a brutal pattern — 23% of callers hung up during the first silence gap before the bot spoke. Median round-trip latency was 4.2 seconds (end of user speech to first audible syllable). Patients interpreted the pause as a dropped call. After rebuilding the pipeline with streaming speech recognition (ASR), predictive end-of-turn detection, parallel LLM pre-generation, and chunked text-to-speech (TTS) playback with barge-in, median latency fell to 1.8 seconds and abandonment dropped to 7%. Voice agents are not chat bots with a microphone attached; they are real-time control systems where timing is the product.
A production LLM voice agent pipeline chains audio capture, voice activity detection (VAD), automatic speech recognition, dialog management, LLM reasoning (often with tools), and speech synthesis — all under hard latency and interruption constraints humans enforce unconsciously in conversation. This guide covers the end-to-end architecture, turn-taking and barge-in, streaming vs batch STT/TTS, latency budgeting, the Harbor Clinic refactor, a technique decision table versus text-only agents, common pitfalls, and an engineering checklist. Pair it with streaming LLM responses and multimodal AI models for broader context.
Pipeline stages from microphone to speaker
Most voice agents run a loop: listen → understand → think → speak. Each stage can block the next if designed naively.
- Capture and VAD — raw PCM or Opus frames arrive from WebRTC, telephony SIP, or mobile SDK. VAD classifies speech vs silence/noise so downstream ASR is not fed endless silence.
- ASR (STT) — audio becomes text, ideally with partial hypotheses while the user still talks.
- End-of-turn (EOT) detection — decides when the user finished a thought vs paused mid-sentence. Too eager triggers interruptions; too slow adds dead air.
- Dialog / LLM — system prompt, conversation history, retrieved context, and tool calls produce the next assistant message. Voice agents often use smaller models or speculative drafts for the first sentence.
- TTS — text becomes audio, streamed in clauses so the first phoneme plays before the full reply is synthesized.
- Playback and barge-in — while TTS plays, VAD listens for user speech that should cancel playback and restart the loop.
Unified “realtime” APIs collapse some boundaries by accepting audio tokens directly, but the same logical stages exist — only the wire format changes.
Voice activity detection and end-of-turn detection
VAD: what counts as speech?
Classical WebRTC VAD is fast but brittle on coughs, hold music, and overlapping background talk. Neural VAD models (Silero, proprietary telephony classifiers) trade a few milliseconds for fewer false starts. Production systems tune hangover (keep recording N ms after energy drops) and pre-roll (buffer audio before VAD fires) to avoid clipping word onsets.
End-of-turn: harder than VAD
Silence alone is a weak signal. “I need to schedule…” followed by two seconds of thought is not end-of-turn; “Yes.” followed by 400 ms often is. Modern pipelines combine:
- Silence duration thresholds — adaptive by utterance length.
- Partial ASR stability — if the last tokens stop changing, EOT confidence rises.
- Prosody / punctuation predictors — small classifiers trained on labeled call-center data.
- LLM “is the user done?” probes — expensive; use only on long ambiguities.
Harbor's first bot waited for 1.2 s of silence after every VAD segment — fine for IVR menus, lethal for natural speech. The refactor used 300–600 ms adaptive thresholds plus partial-transcript stability, cutting false EOT by 41%.
Streaming ASR and partial transcripts
Batch ASR (upload whole utterance, receive final transcript) is simpler but forces the LLM to wait. Streaming ASR emits token hypotheses every 100–300 ms. The dialog manager can:
- Display captions for accessibility.
- Start LLM context assembly before EOT fires.
- Pre-fetch tool arguments when intent is obvious (“cancel my appointment on…”).
Word error rate (WER) on partials is worse than finals — do not commit irreversible tool calls on unstable text. Harbor staged tools: read-only lookups on partials, writes only after EOT confirmation. See speech recognition for CTC, RNN-T, and Whisper-style encoder-decoder trade-offs.
Domain adaptation
Medical, legal, and product-name vocabularies crush generic ASR. Custom vocabularies, shallow fusion with n-gram language models, or fine-tuned adapters routinely cut WER 30–50% on domain terms — often more impactful than upgrading the LLM.
LLM orchestration for voice
Text chat tolerates 3 s to first token; voice does not. Strategies:
- First-sentence mode — generate a short acknowledgment (“Got it, checking your chart”) while tools run.
- Parallel tool fan-out — schedule lookups during user speech when intent classifiers fire early.
- Smaller dialog model — route complex reasoning to a background tier; voice layer stays sub-200 ms TTFT.
- Structured prompts — cap reply length; spoken answers above ~25 words feel like lectures.
Tool latency dominates many flows. A voice agent that calls five sequential APIs will feel broken regardless of ASR speed. Harbor parallelized eligibility + slot search and moved insurance verification to async SMS follow-up, saving 1.1 s per turn. For agent loops with branching tools, see ReAct agent loops — but cap max steps per voice turn.
Streaming TTS, prosody, and barge-in
TTS latency is often the hidden bottleneck. Sentence-chunked synthesis starts playback while later clauses still render. SSML pause tags and comma boundaries become chunk split points. Neural vocoders add 50–150 ms per chunk; cloud APIs hide this behind websockets.
Barge-in (user interruption)
When the user talks over the bot, playback must stop within ~150 ms and the ASR stream must not include echo from the bot's own voice. Approaches:
- Acoustic echo cancellation (AEC) — mandatory on speakerphone and kiosk hardware.
- Ducking / mute TTS reference — subtract known playback signal from the mic path.
- Full-duplex mode — always-on ASR with playback-aware masking.
Without barge-in, callers yell “No, I said Tuesday” into a monologue. Harbor enabled barge-in on clause boundaries first, then moved to sample-accurate cancel after AEC tuning.
Latency budget template
Target < 800 ms to first audible response for transactional flows; < 1.5 s for complex answers. Illustrative budget after Harbor refactor:
| Stage | Budget | Notes |
|---|---|---|
| EOT detection | 300–500 ms | Adaptive silence; not fixed 1.2 s |
| ASR finalize | 150–250 ms | Streaming; partials already in flight |
| LLM TTFT | 200–400 ms | Small model or cached opener |
| TTS first chunk | 200–350 ms | Clause-level synthesis |
| Playback buffer | 50 ms | Jitter on mobile networks |
Measure perceived latency (user stops talking → hears bot) separately from component latencies. Play a subtle earcon during tool calls longer than 600 ms so silence feels intentional.
Harbor Clinic refactor: 4.2 s to 1.8 s median round-trip
Week 1: instrumented waterfall traces — 38% of delay was post-EOT batch ASR, 29% sequential tool calls, 21% waiting for full TTS files. Week 2: deployed streaming ASR with medical hotwords; partial transcripts fed a triage intent classifier. Week 3: replaced fixed silence EOT with stability + adaptive thresholds. Week 4: clause-chunked TTS with barge-in on AEC-enabled telephony. Week 5: parallel eligibility API + canned openers for slot-search turns.
Outcomes: median round-trip 1.8 s (p95 2.9 s), abandonment 7% (from 23%), completed intake rate 81% (from 64%). WER on medication names improved 18% with a custom vocabulary — fewer repair loops meant fewer perceived delays.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Batch ASR + chat LLM + file TTS | Prototypes, voicemail summarization | Live phone or kiosk conversation |
| Streaming STT/TTS with text LLM | Most production voice agents today | Ultra-low-latency duplex without tuning |
| Unified audio-native realtime API | Sub-second duplex after vendor integration | Tool-heavy workflows needing inspectable text |
| Text chat fallback | Accessibility, noisy environments | Callers expecting natural phone dialog |
| Human handoff | High-stakes triage, angry callers | Cost-sensitive high-volume IVR replacement |
Common pitfalls
- Treating voice as slow chat — long markdown replies are unreadable aloud.
- Fixed silence timeouts — punishes thoughtful speakers and fast yes/no answers alike.
- No barge-in — users repeat themselves loudly; ASR sees overlapping garbage.
- Committing tools on partial ASR — wrong patient IDs and canceled appointments.
- Ignoring telephony codecs — 8 kHz narrowband breaks consonants; tune ASR accordingly.
- Missing AEC on speakers — bot hears itself and loops.
- Skipping consent and recording law — two-party states need disclosure before capture.
- No hallucination guard on spoken output — users trust voice more than text; errors feel authoritative.
Engineer checklist
- Instrument end-to-end perceived latency, not just LLM TTFT.
- Use streaming ASR with domain vocabularies for your vertical.
- Implement adaptive end-of-turn; log false positives and negatives.
- Cap spoken reply length; split tools into parallel and async follow-ups.
- Stream TTS by clause; play earcons during long tool calls.
- Enable barge-in with AEC on all speaker playback paths.
- Stage tool writes after EOT-final transcripts only.
- Offer text fallback and clear human escalation phrases.
- Record consent where law requires; redact PHI in logs.
- A/B test abandonment rate, not just WER or BLEU scores.
Key takeaways
- Voice agents are real-time systems — silence feels like failure faster than wrong text.
- End-of-turn detection is its own ML problem — not equivalent to VAD silence.
- Stream every stage — ASR partials, LLM tokens, and TTS clauses overlap work.
- Harbor cut abandonment 23% → 7% by fixing timing, not swapping to a larger LLM.
- Barge-in and AEC are non-negotiable for speaker-based deployments.
Related reading
- Speech recognition explained — ASR pipelines, Whisper, and WER evaluation
- Text-to-speech explained — neural TTS, vocoders, and streaming synthesis
- Streaming LLM responses explained — TTFT, SSE, and websocket patterns
- Human-in-the-loop explained — escalation when automation should stop talking