Summary: The Transformer architecture
“The Transformer” is not a sprawling design space: modern LLMs share one skeleton and differ in a few well-understood choices. The skeleton is a decoder-only model: token embedding, then n_layers identical blocks (each an attention sublayer and an FFN sublayer, with normalization and a residual connection), then a final norm and an output projection. The residual stream of width d_model flows end to end; each sublayer reads from it and adds back. The field converged on pre-norm, RMSNorm, gated activations (SwiGLU), RoPE positions, no biases, and weight tying. A small set of hyperparameters sizes the model (d_model, n_layers, n_heads, d_ff, vocab, context), and parameters scale as about 12 * n_layers * d_model^2, so width dominates. This is the scan version; the lesson makes every model config legible.
Core ideas
Section titled “Core ideas”- One skeleton. Decoder-only: embedding to
n_layersblocks (attention + FFN sublayers, each normed and residual) to final norm to output projection. - The residual stream (width
d_model) flows end to end; sublayers read and add back. Attention moves information across positions; the FFN processes each position. - Converged choices: pre-norm (not post-norm) for deep-model stability; RMSNorm (not LayerNorm); gated FFN activations (SwiGLU, smaller hidden ratio); RoPE rotary positions (not learned absolute); typically no bias terms; weight tying.
- Sizing hyperparameters:
d_model,n_layers,n_heads(andhead_dim),d_ff, vocabulary size, context length. - Parameters ~
12 * n_layers * d_model^2(non-embedding), plusvocab * d_model.d_modeldominates because it enters squared. - Architecture is budget allocation: depth vs width vs data for a fixed compute budget, tying straight back to lesson 2’s cost accounting and forward to scaling laws.
What changes for you
Section titled “What changes for you”This lesson makes model configs legible and reframes architecture as a budget problem rather than a bag of tricks. Open any modern LLM’s config and the fields are exactly these, hidden_size (d_model), num_hidden_layers, num_attention_heads, intermediate_size (d_ff), plus the norm, activation, and positional scheme, and you now know what each does and why it was chosen. The apparent zoo of “different architectures” collapses into one skeleton with a few switched settings. And because the hyperparameters set the parameter count, which sets the cost from lesson 2, choosing a design is choosing how to spend compute across width, depth, and data, the question scaling laws answer with evidence. With the model defined, the next lesson looks at the main variations on its attention sublayer (attention alternatives and mixture of experts) before Phase 2 turns to making it all run fast.
A modern LLM is one skeleton plus a handful of converged choices and a few sizing dials. Learn those, and every model config you open is suddenly legible.