Guide

Text-to-speech explained

A navigation app reads the next turn aloud. An audiobook player voices a chapter while you commute. A support bot confirms your refund in a calm, branded tone. Each scenario runs the inverse of automatic speech recognition: text-to-speech (TTS) converts written language into intelligible, natural-sounding audio. Early systems stitched recorded phonemes; modern neural TTS learns pronunciation, rhythm, and timbre end-to-end from hours of speech data. This guide covers the TTS pipeline from text normalization through acoustic models and vocoders, prosody and voice control, zero-shot cloning risks, evaluation with mean opinion score (MOS), a Harbor Fleet driver-navigation worked example, an approach decision table, common pitfalls, and a practitioner checklist — alongside NLP fundamentals and multimodal AI.

What TTS must solve

Written text is ambiguous. “St.” might be “saint” or “street.” “2026-06-08” should sound like a date, not a subtraction problem. “SOL” in a crypto wallet context is a ticker, not a Latin sun god. TTS front ends perform text normalization (TN) and grapheme-to-phoneme (G2P) conversion so downstream models receive speakable token sequences. The acoustic stage then predicts how those units sound over time — pitch contour, duration, energy — and a vocoder renders the final waveform listeners hear through speakers or earbuds.

Quality dimensions beyond “it talks”

  • Intelligibility — every word must be understandable at target playback speed.
  • Naturalness — rhythm, coarticulation, and breath pauses feel human, not robotic.
  • Speaker consistency — the same voice identity across sessions and sentence lengths.
  • Prosody control — emphasis, questions, and list intonation match intent.
  • Latency — streaming TTS must start audio within hundreds of milliseconds for live UX.

Product teams often optimize one dimension and break another: ultra-low-latency streaming models may sacrifice emotional range; studio-quality audiobook voices may need seconds of compute per sentence. Define your latency and quality bar before choosing an architecture.

The TTS pipeline: normalization, acoustics, vocoding

Production TTS rarely feeds raw Unicode straight into a neural net. A typical stack has four layers: text processing, linguistic feature extraction, acoustic modeling, and vocoding.

Text normalization and G2P

Normalization expands abbreviations, reads numbers and currencies aloud, and splits heteronyms using context (“read” past vs present). Rule engines handle predictable patterns; small classifiers or LLM-assisted TN cover messy user-generated text. G2P maps spellings to phonemes — critical for names and loanwords missing from lexicons. Multilingual systems may skip explicit phonemes and operate on subword tokens learned jointly with audio, but explicit G2P still helps rare words in English and morphologically rich languages.

Acoustic models: from spectrograms to mel codes

Classical concatenative TTS selected short recorded units (diphones or half-phones) and blended them — intelligible but brittle on out-of-vocabulary words. Parametric systems (HMM-based) predicted spectral envelopes; naturalness lagged behind human recordings. Neural TTS changed the tradeoff:

  • Autoregressive seq2seq (Tacotron, Tacotron 2) — an encoder reads text; a decoder predicts mel spectrogram frames one step at a time. High quality, slower inference.
  • Non-autoregressive (FastSpeech, FastSpeech 2) — a duration predictor expands text embeddings to fixed-length acoustic sequences, enabling parallel generation and lower latency.
  • End-to-end flow / VITS-style models — combine acoustic and vocoder training with adversarial losses and stochastic duration, producing strong single-speaker voices from modest data.
  • Large-scale multilingual TTS (VALL-E, XTTS, commercial APIs) — condition on speaker embeddings or short reference clips for zero-shot timbre transfer across languages.

Vocoders: mel spectrograms to waveforms

Acoustic models usually output mel spectrograms — compressed time-frequency representations aligned with human hearing. A vocoder inverts mel to audio: WaveNet and WaveGlow offered high fidelity at high cost; HiFi-GAN and BigVGAN deliver real-time neural vocoding on GPUs and some NPUs. Mismatch between acoustic-model mel statistics and vocoder training data causes metallic buzzing — always pair or jointly fine-tune stacks.

Prosody, SSML, and controllable speech

Speech Synthesis Markup Language (SSML) tags let authors insert pauses, spell out acronyms, or shift emphasis (<emphasis>, <break time="500ms"/>). Neural models also accept prosody embeddings — pitch and energy curves, speaking rate scalars, or style tokens (“cheerful”, “news anchor”). Controllability matters for IVR menus (consistent pacing) and games (bark lines with varied emotion) but increases annotation cost during training.

Batch vs streaming deployment

Batch TTS synthesizes entire documents or chapters offline — audiobooks, podcast generation, nightly report narration. You can afford autoregressive quality and heavy vocoders because latency is measured in minutes, not milliseconds.

Streaming TTS feeds clauses or phoneme chunks to the client as they are generated — voice assistants, live caption reading, multiplayer game callouts. Techniques include chunked inference, distilled student models, and vocoders with causal convolutions. Measure time-to-first-audio (TTFA) and real-time factor (RTF) alongside MOS: a voice that sounds great but starts 800 ms late feels broken in conversation.

Hybrid architectures run a fast streaming backbone for the first phrase, then refine prosody on a second pass — useful when UX demands immediate feedback but marketing wants polished timbre on longer utterances.

Voice cloning, consent, and misuse

Few-shot and zero-shot TTS can mimic a speaker from 3–30 seconds of reference audio. Legitimate uses include personalized accessibility voices, dubbing with actor consent, and brand-consistent agents. Misuse spans deepfake fraud, non-consensual impersonation, and bypassing voice biometric authentication.

Production guardrails: verify speaker consent and identity before enrollment; watermark synthesized audio; rate-limit cloning APIs; log synthesis requests; refuse public-figure voices without contractual rights; disclose synthetic speech in customer-facing products. Regulation (EU AI Act transparency rules, state deepfake laws) is tightening — treat voice data like biometrics, not disposable training fodder.

Evaluating TTS quality

Unlike ASR’s word error rate, TTS lacks a single objective gold standard. Teams combine subjective listening tests with proxy metrics:

  • Mean Opinion Score (MOS) — human raters score naturalness 1–5 on held-out sentences; the industry benchmark, but expensive and panel-dependent.
  • MOSNet / UTMOS — neural estimators predicting MOS from audio; useful for regression testing, not a full substitute for human ears.
  • Mel cepstral distortion (MCD) — frame-level distance to reference mel when ground-truth recordings exist (voice conversion, resynthesis tasks).
  • Intelligibility probes — feed synthesized audio into a strong ASR model; high WER on TTS output signals garbled consonants or dropped endings.
  • Speaker similarity — embedding cosine distance for cloning tasks; ensures the right voice, not just smooth audio.

Evaluate on diverse sentence types: short commands, long compound sentences, numbers, URLs, emoji-adjacent symbols, and domain jargon. A model that nails news sentences may stumble on warehouse SKUs or crypto addresses.

Worked example: Harbor Fleet turn-by-turn navigation voice

Harbor Fleet’s driver app needed spoken turn instructions over cab noise without recording every street name manually. Constraints: English US primary, Spanish secondary, TTFA under 300 ms on mid-range Android, offline fallback when tunnels kill connectivity.

Pipeline design

  1. Text source — routing engine emits structured strings: Turn right on Oak Street in two hundred feet.
  2. Normalization — expand abbreviations (St. → Street), read distances with locale rules, pronounce harbor zone codes via custom lexicon.
  3. Voice — single-speaker FastSpeech 2 + HiFi-GAN trained on 12 hours of approved talent audio; Spanish via multilingual XTTS fine-tune sharing prosody style.
  4. Streaming — synthesize first clause while GPS fetches the next; cache top 200 street names per metro as pre-rendered WAV chunks for zero-latency hits.
  5. Fallback — on-device Piper model when GPU path unavailable; slightly lower MOS but deterministic offline behavior.

Results

Human MOS on held-out navigation sentences rose from 3.6 (stock cloud API) to 4.1 (custom FastSpeech 2 stack). TTFA median 210 ms online, 90 ms on cache hits. ASR-roundtrip intelligibility WER was 6.1% at 65 dB cab noise — acceptable for safety-critical directions. The team rejected full zero-shot cloning: legal required talent contracts, and cloned voices without per-session disclosure failed compliance review.

Approach decision table

Scenario Recommended approach Why
Fast prototype, many languages Cloud TTS API (Google, Amazon Polly, Azure) Low setup; pay per character; SSML support built in
Branded single-speaker product voice Record talent; train FastSpeech 2 + HiFi-GAN or VITS Consistent identity; no per-character API cost at scale
Live voice assistant Streaming non-autoregressive model + causal vocoder TTFA and RTF beat autoregressive Tacotron-class latency
Audiobook or long-form narration Autoregressive or LLM-augmented TTS with batch vocoder Prioritize MOS and breath pacing over milliseconds
Offline mobile / embedded Distilled Piper or on-device edge models No network; smaller footprints; predictable battery use
Personalized cloning with consent Embedding-based adapter on frozen base TTS Few minutes of enrollment audio; isolate per-user weights
Regulated health or finance IVR On-prem model + SSML templates + disclosure prefix Data residency; scripted prosody reduces hallucinated wording

Common pitfalls

  • Skipping normalization — raw “Dr.” and “3/4” tokens produce embarrassing misreads; invest in TN before tuning acoustics.
  • Acoustic–vocoder mismatch — training mels with different hop sizes or normalization than the vocoder expects yields persistent hiss.
  • Judging quality on three demo sentences — models overfit cheerful marketing copy; test lists, numbers, and rare proper nouns.
  • Ignoring speaking rate drift — long paragraphs accelerate or flatten without duration controls; use SSML breaks or duration predictors.
  • Cloning without consent — legal and reputational risk; implement enrollment verification and abuse monitoring from day one.
  • Optimizing MOS only in quiet labs — deploy with cab noise, Bluetooth compression, and cheap phone speakers in the eval loop.
  • Monolingual TN in multilingual products — date and currency rules differ; Spanish prosody on English TN output sounds foreign immediately.
  • No synthetic speech disclosure — users trust voice interfaces; hidden TTS in outbound calls erodes trust and may violate law.

Practitioner checklist

  • Define TTFA, RTF, and MOS targets for your UX mode (streaming vs batch).
  • Build a normalization + G2P layer with custom lexicons for product vocabulary.
  • Pair acoustic models and vocoders trained on compatible mel statistics.
  • Curate eval sets covering numbers, abbreviations, names, and edge-case symbols.
  • Run ASR-roundtrip intelligibility tests alongside human MOS panels.
  • Cache or pre-render high-frequency phrases for latency-critical paths.
  • Document talent consent, retention, and cloning enrollment policies.
  • Watermark or log synthetic audio for fraud investigation workflows.
  • Test playback on target devices and codecs, not only studio headphones.
  • Disclose synthetic speech in user-facing experiences where regulation or trust requires it.

Key takeaways

  • TTS inverts ASR: normalize text, predict acoustic features, vocode to waveform.
  • Neural stacks trade autoregressive quality against non-autoregressive speed — pick based on streaming vs batch needs.
  • MOS and ASR-roundtrip tests complement each other; neither alone captures production readiness.
  • Voice cloning demands consent, disclosure, and abuse controls — not just better embeddings.
  • Domain lexicons and normalization often beat bigger models for SKU-heavy or address-heavy products.

Related reading