Guide

Speech recognition explained

A customer leaves a voicemail asking to change a shipping address. A clinician dictates notes between appointments. A voice assistant hears “set a timer for ten minutes.” In every case, software must convert raw audio waveforms into accurate text. Automatic speech recognition (ASR) is that conversion pipeline: it segments sound, extracts acoustic features, maps them to phonemes or characters, and applies language constraints so “recognize speech” does not become “wreck a nice beach.” Modern ASR spans classical hidden Markov models, end-to-end neural networks with CTC and RNN-T losses, and large transformer models like Whisper that handle dozens of languages from a single checkpoint. This guide covers the ASR signal path, evaluation with word error rate (WER), batch versus streaming deployment, a Harbor Support call-transcription worked example, an approach decision table, common pitfalls, and a practitioner checklist alongside NLP fundamentals and multimodal AI.

The ASR pipeline: from waveform to transcript

At a high level, every ASR system performs three jobs: front-end processing turns the microphone signal into a compact representation the model can learn from; acoustic modeling maps those features to linguistic units (phonemes, characters, or byte-pair tokens); and decoding chooses the most likely word sequence, often with help from a language model that knows “close the account” is more probable than “clothes the account.”

Audio front end

Raw audio arrives as a time series sampled at 16 kHz (telephony) or 44.1 kHz (music). ASR almost always resamples to 16 kHz mono, applies pre-emphasis, and slices the stream into overlapping frames (typically 25 ms windows, 10 ms hop). Each frame becomes a mel spectrogram or MFCC vector: a log-scaled summary of energy in frequency bands tuned to human hearing. Mel features discard phase information most models do not need while preserving the cues that distinguish consonants from vowels.

Acoustic and language models

Classical systems split the problem: a Gaussian-mixture or deep neural acoustic model scores frame-to-phoneme alignments, while a separate n-gram language model scores word sequences. End-to-end neural ASR merges both into one network trained on paired (audio, transcript) data. The network learns to emit characters or subword tokens directly; a lightweight language model or transformer decoder still helps on rare words and proper nouns.

Decoding objectives

CTC (Connectionist Temporal Classification) — allows the network to output blank tokens and collapse repeats; good for offline transcription when alignment is unknown.
RNN-T (transducer) — pairs an encoder with a prediction network and a joint network; supports streaming because it can emit tokens before seeing the full utterance.
Attention-based seq2seq — encoder-decoder with cross-attention; flexible but historically harder to stream; modern variants add chunk-wise attention for low latency.
Whisper-style transformers — treat log-mel spectrogram patches as tokens, decode text autoregressively; strong out-of-the-box multilingual performance at the cost of compute.

Evaluating ASR: word error rate and beyond

ASR quality is measured primarily by word error rate (WER):

WER = (S + D + I) / N

where S is substitutions, D deletions, I insertions, and N is the number of words in the reference transcript. A WER of 5% means five errors per hundred reference words. WER is computed after text normalization: lowercasing, removing punctuation, and sometimes expanding numbers (“twenty three” vs “23”) so scoring reflects recognition, not formatting.

WER alone misleads on short commands. For intent-driven voice UIs, track keyword accuracy or slot-filling F1 on the entities that matter. For accessibility captions, measure real-time factor (RTF) — processing time divided by audio duration — alongside WER so you do not ship accurate but unusably slow transcripts. Report WER slices by accent, noise condition, and domain vocabulary; aggregate numbers hide failure modes on underrepresented speakers.

Batch vs streaming ASR

Batch (offline) ASR processes complete files — voicemails, podcasts, meeting recordings. It can look ahead across the full utterance, run beam search with a large language model, and apply second-pass rescoring. Latency is seconds to minutes; accuracy is highest. Whisper, Nvidia NeMo offline models, and cloud batch APIs target this mode.

Streaming ASR emits partial hypotheses as audio arrives — voice assistants, live captions, dictation. The model trades lookahead for latency, typically targeting sub-300 ms end-to-end delay. RNN-T and chunked transformer encoders dominate here. Partial results flicker (“set a” then “set a timer”); UI layers stabilize display with endpoint detection and confidence thresholds.

Hybrid architectures run a fast streaming model for live feedback and a slower batch pass for final transcripts — common in medical dictation where clinicians need immediate visual confirmation but charts require polished text.

Worked example: Harbor Support call transcription

Harbor Support receives 12,000 inbound calls per month. Agents currently replay voicemails manually to fill CRM tickets. The team wants automatic transcripts for triage, not verbatim legal records.

Requirements

Audio: 8 kHz narrowband phone audio, average 90-second messages, background HVAC noise.
Latency: under five minutes post-call is acceptable (batch mode).
Vocabulary: product SKUs, city names, competitor brands — roughly 2,000 domain terms.
Privacy: calls contain PII; processing must stay in-region; retention 30 days.

Pipeline design

Ingest — S3 trigger on uploaded WAV; normalize to 16 kHz mono FLAC.
VAD — voice activity detection strips leading/trailing silence to cut compute.
ASR — start with Whisper medium fine-tuned on 200 hours of anonymized Harbor calls (LoRA adapters on decoder layers only).
Post-processing — inverse text normalization (ITN) for order numbers; custom phrase list boosts SKU recognition via shallow fusion at decode time.
Quality gate — if average token log-probability falls below threshold, route to human review instead of auto-ticketing.

Results

Base Whisper medium scored 14.2% WER on a held-out test set. After LoRA fine-tuning on domain audio and adding the SKU phrase list, WER dropped to 8.7%. Auto-ticketing covered 71% of voicemails; the remainder flagged for agents. Median processing latency was 22 seconds per minute of audio on a single A10 GPU — acceptable for batch. The team rejected a streaming RNN-T deployment because agents did not need live captions and streaming sacrificed 2 WER points on noisy narrowband audio.

Approach decision table

Scenario	Recommended approach	Why
Quick multilingual prototype	Whisper API or open-weight `large-v3`	Strong zero-shot WER across 90+ languages without custom training
Low-latency voice commands	Streaming RNN-T on-device	Sub-200 ms partial results; edge inference avoids round trips
Domain-heavy vocabulary (medical, legal)	Fine-tuned CTC or Whisper + phrase boosting	Generic models miss rare terms; 50–500 hours of in-domain audio pays off
Phone / narrowband audio	Train or fine-tune on 8 kHz data; telephony augmentation	Models trained only on clean podcast audio fail on PSTN bandwidth
Speaker diarization (“who said what”)	ASR + separate diarization model (pyannote, NeMo)	ASR alone does not label speakers; combine pipelines
Regulated data, no cloud egress	On-prem Whisper or Conformer with model serving on private GPUs	APIs may violate data-residency requirements

Common pitfalls

Training on clean speech, deploying on noise — augment with background babble, music, and codec compression; measure WER in realistic SNR ranges.
Ignoring text normalization mismatch — if training transcripts use spoken form (“two p m”) but evaluation expects written form (“2pm”), WER inflates artificially; align normalization pipelines.
Leakage through overlapping speakers — diarization errors assign words to the wrong agent; fix speaker segmentation before scoring agent performance.
Underestimating punctuation and casing — downstream NER and search break when transcripts lack sentence boundaries; add a light post-processor or train with punctuation in labels.
Streaming UI without endpoint detection — users see unstable partial text; implement voice activity detection and hold partials until clause boundaries.
No human fallback — ASR at 8% WER still means one wrong word every twelve; route low-confidence segments to review for high-stakes workflows.
Storing raw audio without consent — transcription pipelines implicate privacy law; document retention, opt-out, and on-device options.
Chasing WER on English-only benchmarks — multilingual products need per-locale test sets; code-switching (Spanglish) breaks single-language models.

Practitioner checklist

Define latency mode upfront: batch file transcription vs streaming partial results.
Collect or source in-domain audio matching deployment codec, sample rate, and noise.
Normalize reference and hypothesis text consistently before computing WER.
Slice evaluation by accent, channel type, and utterance length — not one aggregate number.
Augment training with speed perturbation, spec augment, and background noise.
Maintain a domain phrase list or shallow fusion for product names and jargon.
Log confidence scores; route bottom decile to human review or re-prompt.
Separate ASR from diarization when multi-speaker attribution matters.
Measure RTF alongside WER so accuracy gains do not violate latency budgets.
Document audio retention policy and offer on-device paths where regulation requires it.

Key takeaways

ASR converts audio frames (mel spectrograms) into text through acoustic modeling and decoding — classical pipelines split components; modern systems learn end-to-end.
WER is the standard metric, but slice it by condition and pair with latency (RTF) for production voice products.
Batch models like Whisper maximize accuracy on full files; streaming RNN-T models optimize partial latency for live interaction.
Domain fine-tuning, phrase boosting, and noise-matched training close most gaps between generic APIs and specialized deployments.
Treat transcripts as probabilistic input to downstream NLP — always design human fallback and privacy controls around the audio pipeline.