Guide
Transformer architecture explained
The transformer is the neural network architecture behind nearly every modern large language model (LLM), from GPT and Claude to open-weight Llama and Mistral checkpoints. Introduced in the 2017 paper Attention Is All You Need, it replaced recurrent and convolutional sequence models with a stack of self-attention layers that process all tokens in parallel. That parallelism made training on web-scale text practical and unlocked the scaling laws product teams exploit today through prompt engineering, RAG, and fine-tuning. This guide explains how transformers work at a level sufficient to reason about latency, context limits, and model choice — without requiring a PhD in linear algebra.
Why transformers replaced RNNs
Before transformers, language models often used recurrent neural networks (RNNs) and LSTMs that read tokens one at a time, carrying a hidden state forward. That design is inherently sequential: token 500 cannot be processed until tokens 1–499 finish. Training long documents on GPUs suffers because parallelism is limited and gradients vanish across distant time steps.
Transformers flip the bottleneck. Each layer lets every token attend to every other token in the sequence (subject to masking rules in decoders). Matrix multiplications across the full sequence map cleanly to GPU tensor cores. The trade-off is quadratic memory and compute in sequence length — attention over n tokens costs O(n²) — which is why context window expansion remains an active engineering problem. For the 4k–128k windows common in production chat, transformers still win decisively on throughput and trainability.
The high-level picture
A transformer block repeats two sub-layers: multi-head self-attention (tokens exchange information) and a position-wise feed-forward network (each token is transformed independently). Residual connections and layer normalization wrap both sub-layers so gradients flow through deep stacks — modern LLMs stack dozens of identical blocks (e.g., 32–80+ layers) with billions of parameters in the FFN and attention projections.
Self-attention in plain language
Self-attention answers: for this token, which other tokens in the sentence matter right now? Each token is embedded as a vector, then projected into three roles:
- Query (Q) — what this token is looking for
- Key (K) — what this token advertises about itself
- Value (V) — the information this token would contribute if selected
Attention scores are computed as dot products between queries and keys (scaled by √d to keep gradients stable), passed through softmax to get weights that sum to 1, then used to form a weighted sum of value vectors. Intuitively, the word bank in "river bank" attends strongly to river and weakly to unrelated tokens — disambiguation emerges from learned projections, not hand-written rules.
Multi-head attention
Instead of one attention pass, transformers run h parallel heads with smaller dimension per head, then concatenate and project the result. Each head can specialize: one may track syntactic dependencies, another coreference, another local n-gram patterns. Heads are not manually assigned; they emerge from training on billions of tokens. The head count and model dimension (e.g., 96 heads × 128 dims = 12,288-d model in large variants) are hyperparameters fixed at architecture design time.
Causal masking in decoders
Decoder-only models (GPT family) generate text left-to-right. During training, future tokens must not leak into past positions — otherwise the model could cheat by peeking at the answer. A causal mask sets attention weights to zero for illegal (future) positions before softmax. Encoder-only models (BERT) use bidirectional attention because they see the full input at once; encoder-decoder models (original Transformer, T5) bidirectionally encode the source and causally decode the target.
Positional encoding: order without recurrence
Self-attention is permutation-invariant — shuffling tokens without telling the model which position each token held would scramble meaning. Transformers inject positional information so order is recoverable. Early designs added fixed sinusoidal encodings to input embeddings. Most modern LLMs use rotary positional embeddings (RoPE) or learned absolute positions, which generalize better when inference sequences differ in length from training and when extending context via techniques like YaRN or NTK-aware scaling.
Position encoding interacts directly with context window limits: extrapolating far beyond trained lengths degrades attention quality unless the architecture and fine-tuning explicitly support longer RoPE bases. That is one reason two models with the same parameter count can behave differently at 32k tokens.
Encoder, decoder, and the shapes that matter today
The original transformer paired an encoder (bidirectional) with a decoder (causal) for machine translation. Three families dominate production in 2026:
- Encoder-only (BERT, RoBERTa) — excels at classification, embedding extraction, and bi-encoder retrieval encoders in vector search pipelines. Not used for open-ended generation.
- Decoder-only (GPT, Llama, Mistral) — a single causal stack trained on next-token prediction. Chat, agents, and code assistants are almost always decoder-only because one architecture handles pretraining and instruction tuning cleanly.
- Encoder-decoder (T5, BART, some translation models) — still strong for seq2seq tasks (summarization, translation) but less common for general chat LLMs; some multimodal systems use cross-attention from a vision encoder into a text decoder.
When you call an API and receive streamed tokens, you are watching a decoder-only transformer sample from a probability distribution over the vocabulary at each step. Temperature and top-p sampling modulate that distribution at inference; the architecture itself is unchanged.
Feed-forward layers and where parameters live
After attention mixes information across tokens, each position passes through the same two-layer MLP (often called the FFN), typically expanding to 4× the model dimension and contracting back — e.g., 12,288 → 49,152 → 12,288 in large models. Surprisingly, most weights live in FFN matrices, not attention. Attention provides flexible routing; FFNs store factual and lexical patterns learned during pretraining. This split informs parameter-efficient fine-tuning: LoRA adapters often target attention projections because they steer behavior with fewer trainable parameters, while full fine-tunes update FFNs when you need deep domain knowledge injection.
Layer normalization and residuals
Each sub-layer uses a residual connection (output = input + sublayer(input))
and layer norm (Pre-LN is standard in modern stacks). Residuals let optimizers train
very deep networks; Pre-LN stabilizes early training compared to Post-LN used in the
original paper. These details rarely appear in product docs but explain why
distillation and pruning target specific layers — early layers often encode syntax,
late layers style and task behavior.
Inference: KV cache and why long chats cost more
Autoregressive generation recomputes keys and values for all prior tokens at every new step — wasteful. Production systems cache K and V tensors per layer for past tokens (the KV cache) and only project the newest token's Q, K, V. Memory grows linearly with context length × layers × head dimension × batch size. That is the hardware reason long context sessions increase GPU RAM and why providers meter input tokens separately from output tokens.
Techniques like grouped-query attention (GQA) and multi-query attention (MQA) share K/V heads across query heads to shrink cache size — common in Llama 2/3 class models serving millions of requests. FlashAttention-style kernels fuse attention math to reduce HBM round-trips. These are implementation optimizations atop the same transformer math described above.
From architecture to application
Understanding transformers clarifies several product decisions:
- Embeddings — the final hidden state or a pooled encoder output becomes a dense vector for semantic search; quality depends on encoder training, not your vector DB brand alone.
- RAG — retrieval supplies tokens the model attends to in-context; attention is the mechanism that integrates them. Chunk size and ordering change which keys receive weight.
- Fine-tuning vs prompting — prompting steers attention patterns transiently; fine-tuning permanently shifts weights in attention and FFN layers.
- Evaluation — benchmarks measure emergent capabilities of scaled decoder stacks; knowing the architecture family helps interpret eval results across model releases.
You do not need to implement attention from scratch to ship LLM features — frameworks like PyTorch, JAX, and vLLM handle that. You do need enough literacy to debug hallucinations (attention over weakly weighted retrieved chunks), latency spikes (KV cache pressure), and model swaps (encoder vs decoder, context length, quantization trade-offs).
Key takeaways
- Transformers process sequences with parallel self-attention instead of step-by-step recurrence — the foundation of modern LLM training scale.
- Queries, keys, and values implement soft lookup between tokens; multi-head attention runs several specialized lookups in parallel.
- Decoder-only GPT-style stacks dominate chat and agents; encoders power embeddings and bi-encoder retrieval in RAG systems.
- Positional encodings (often RoPE) restore token order; they interact with how far beyond training length you can stretch context.
- KV caching makes autoregressive inference feasible but ties cost directly to context length — the bridge between architecture and your API bill.
Related reading
- LLM context windows explained — token budgets, KV cache costs, and RAG patterns shaped by attention's O(n²) cost
- RAG explained — how retrieved chunks become tokens the transformer attends to at inference time
- LLM fine-tuning explained — LoRA, QLoRA, and which transformer layers adapters typically target
- Vector databases explained — storing encoder embeddings and running ANN search that feeds transformer context