Guide
Word embeddings explained
Before transformers and chat models, NLP teams faced a blunt problem: computers read text as discrete symbols, not meaning. A word embedding maps each token to a dense vector — typically 100–300 floats — so that words appearing in similar contexts land near each other in space. "King" and "queen" sit closer than "king" and "carburetor." That geometric structure powers search, clustering, recommendation, and the input layers of virtually every modern language model. This guide explains why one-hot encoding fails, how Word2Vec, GloVe, and FastText learn vectors, what famous analogy arithmetic actually captures, where static embeddings break on polysemy, how contextual embeddings replaced them, a Harbor catalog search worked example, a method decision table, common pitfalls, and a practitioner checklist. For measuring vector closeness, see our cosine similarity guide; for production-scale passage vectors, see LLM embeddings explained.
Why one-hot encoding is not enough
The naive representation is a one-hot vector: vocabulary size V, each word a corner of a V-dimensional hypercube with a single 1 and zeros elsewhere. For a 50,000-word vocabulary, every word is orthogonal to every other — cosine similarity is zero regardless of semantic relatedness. "Dog" and "puppy" are as far apart as "dog" and "quasar."
One-hot vectors are also sparse and huge. Storing and multiplying them wastes memory and compute. Neural networks fed one-hot inputs must learn all relationships from scratch on every training run. Embeddings compress the vocabulary into a lower-dimensional, dense space where statistical regularities — synonyms, topical co-occurrence, morphological families — are already encoded.
The guiding hypothesis is distributional semantics: "You shall know a word by the company it keeps" (Firth, 1957). Words that appear in interchangeable contexts should have similar vectors. Embeddings operationalize that idea with linear algebra.
Word2Vec: predictive embeddings from context
Mikolov et al.'s Word2Vec (2013) trains embeddings by predicting context from a word, or a word from context, on large unlabeled corpora. Two architectures dominate:
Skip-gram
Given a center word, predict surrounding words within a window (e.g. five tokens on each side). Rare words get more gradient updates because they appear less often but still need good vectors. Skip-gram tends to work better for smaller datasets and rare tokens.
CBOW (Continuous Bag of Words)
Average the vectors of context words and predict the center word. CBOW trains faster and smooths over context, which helps frequent words but can blur rare or nuanced terms.
Both use a shallow neural network with a single hidden layer — the hidden weights are the embeddings. Training optimizes cross-entropy over millions of (word, context) pairs. Negative sampling speeds training by updating only a handful of "noise" words per positive example instead of the full softmax over the vocabulary. Hierarchical softmax is an alternative tree-based approximation.
Output vectors capture syntactic regularities (verb tense, plural forms) and semantic clusters (countries, colors, professions). Dimensionality is a hyperparameter: 100–300 dimensions is typical; more dimensions fit more nuance but risk overfitting on small corpora.
GloVe: global co-occurrence statistics
GloVe (Global Vectors, Pennington et al., 2014) takes a different path. Instead of local prediction windows, it builds a word co-occurrence matrix across the corpus — how often word i appears near word j. The model factorizes this matrix so that the dot product of two word vectors approximates the log of their co-occurrence probability.
Intuitively, Word2Vec learns from streaming context pairs; GloVe aggregates global statistics first, then fits vectors. In practice, GloVe and Word2Vec often perform similarly on analogy and similarity benchmarks, though GloVe can be faster to train when the co-occurrence matrix fits in memory. Pre-trained GloVe files (Wikipedia + Gigaword, Common Crawl) remain popular baselines for academic reproduction.
FastText: subword information for morphology
Word2Vec and GloVe assign one vector per word type. That fails on
out-of-vocabulary (OOV) tokens and morphologically rich
languages. FastText (Facebook, 2016) represents each word as
the sum of its character n-gram vectors. "where" might decompose into
<wh, whe, her, ere,
re> plus the whole-word token.
Unseen words like "unhappiness" inherit vectors from shared substrings ("happi", "ness"). FastText improves performance on morphologically complex languages (Finnish, Turkish) and handles typos better than whole-word models. Trade-off: larger model files and slightly slower inference because each word touches many n-gram buckets.
What embeddings capture — and what analogies really mean
The famous example king - man + woman ≈ queen suggests embeddings
encode relational structure as vector offsets. Gender, tense,
capital-country, and comparative patterns often appear as roughly parallel
directions in space. This fueled excitement that embeddings "understand"
semantics.
Reality is messier. Analogies work best on frequent, stereotypical pairs in the training corpus. Rare words, domain jargon, and culturally biased associations (gender stereotypes in profession vectors) expose limits. Analogies are a diagnostic, not proof of human-like comprehension. Always audit embedding spaces for bias before deploying in hiring, lending, or moderation systems.
Similarity is measured with cosine similarity or Euclidean distance after optional L2 normalization. Most pre-trained Word2Vec/GloVe vectors are used with cosine by convention.
The polysemy problem: static vs contextual embeddings
Classic word embeddings assign one vector per word type regardless of sense. "Bank" (financial institution) and "bank" (river edge) share a single point — usually an average of both contexts. That hurts tasks requiring sense disambiguation: machine translation, question answering, fine- grained search.
Contextual embeddings solved this. ELMo (2018) produced different vectors for the same word in different sentences using bidirectional LSTMs. BERT and successors use transformer layers; each token's vector depends on its full sentence context. "Bank" in "river bank" and "investment bank" get distinct representations.
Modern production systems rarely train Word2Vec from scratch. They use LLM embedding models (sentence-level) or fine-tuned transformers. Static embeddings remain useful as fast baselines, interpretable features, and teaching tools — and as the input layer concept that every NLP engineer should understand.
Worked example: Harbor Supply catalog search
Harbor Supply runs a 40,000-SKU hardware catalog. Shoppers search with informal language: "thing for hanging pictures" instead of "picture hanging kit." The team needs semantic recall beyond keyword match.
Phase 1 — Word2Vec baseline
They train Skip-gram embeddings on product titles, descriptions, and two years of search logs (window size 5, 200 dimensions, negative sampling 5). Query "hanging pictures" averages the vectors of each token (excluding stopwords) and retrieves nearest product title vectors by cosine similarity. Recall@10 improves 18% over BM25 alone on a labeled test set — but confuses "picture frame" with "picture hook" because static vectors blur fine distinctions.
Phase 2 — Hybrid retrieval
They keep Word2Vec for cheap candidate generation, then rerank top 50 with a small sentence embedding model (see hybrid search). Latency stays under 80 ms p95. For categories with heavy polysemy ("spring" as coil vs season), contextual reranking is mandatory.
Lesson
Word2Vec is a strong, interpretable first step when you have domain text and need a lightweight index. Upgrade when polysemy, cross-lingual queries, or long natural-language questions dominate.
Method decision table
| Approach | Best when | Watch out for |
|---|---|---|
| Word2Vec Skip-gram | Medium corpus, rare words matter, fast training | One vector per word; OOV tokens get random or zero vectors |
| Word2Vec CBOW | Large corpus, frequent words, speed priority | Smoother but less sharp on rare terms |
| GloVe | Global co-occurrence stats available; reproducible baselines | Memory for co-occurrence matrix on huge vocabularies |
| FastText | Morphologically rich language, typos, OOV handling | Larger models; subword collisions on very short words |
| Contextual (BERT, LLM embeddings) | Polysemy, QA, semantic search, production quality | Compute cost; not a single lookup table per word |
| Pre-trained general vectors | Prototyping, transfer to new domain with little data | Domain mismatch — "python" may mean snake or language |
Common pitfalls
- Training on too little data — embeddings need millions of tokens; a 10 MB corpus produces noisy, unusable vectors.
- Skipping preprocessing — inconsistent casing ("Apple" vs "apple") splits vocabulary; decide on lowercasing and stick to it.
- Treating analogies as ground truth — impressive demos hide failure modes on rare words and biased directions.
- Ignoring OOV at inference — plan for UNK tokens, subword models, or fallback to BM25.
- Averaging word vectors for sentences — works as a crude baseline but loses word order; use sentence encoders for real queries.
- Domain mismatch — Wikipedia-trained vectors weak on legal, medical, or internal enterprise jargon; fine-tune or train in-domain.
- Confusing embedding layer with final representation — in transformers, the input embedding is only the first layer; contextual depth matters.
Practitioner checklist
- Define vocabulary size, casing, and punctuation rules before training.
- Ensure corpus scale is adequate (millions of tokens minimum for from-scratch training).
- Pick Skip-gram vs CBOW based on rare-word importance and corpus size.
- Evaluate on intrinsic tests (word similarity benchmarks) and extrinsic downstream tasks.
- Audit nearest neighbors for bias and offensive associations before launch.
- Handle OOV with FastText subwords, UNK token policy, or hybrid retrieval.
- Normalize vectors if your similarity metric requires it (cosine on L2-normalized vectors).
- Benchmark against BM25 — embeddings should beat keyword baselines on your queries.
- Plan upgrade path to contextual embeddings when polysemy errors appear in production.
- Version and document training corpus — embeddings drift when source text changes.
Key takeaways
- Word embeddings map tokens to dense vectors where distributional similarity becomes geometric closeness — the foundation of modern NLP.
- Word2Vec learns from local prediction (Skip-gram, CBOW); GloVe factorizes global co-occurrence; FastText adds subword structure for OOV and morphology.
- Static embeddings assign one vector per word type — fast and interpretable but blind to polysemy and context.
- Contextual and LLM embeddings replaced static vectors in production search and understanding tasks, but the concepts remain essential.
- Always evaluate on your domain — pre-trained vectors and analogy demos are starting points, not guarantees.
Related reading
- Cosine similarity explained — comparing vectors for search and recommendations
- LLM embeddings explained — passage-level vectors and model choice for RAG
- Transformer architecture explained — contextual representations and attention
- NLP fundamentals explained — tokenization, parsing, and the NLP pipeline