Skip to content

Cheatsheet: The Transformer architecture

token IDs
-> token embedding (-> vectors of width d_model)
-> n_layers x block:
residual + attention(norm(x)) # moves info across positions
residual + FFN(norm(x)) # processes each position
-> final norm
-> output projection -> vocab logits # often weight-tied to embedding

The residual stream (width d_model) flows end to end; each sublayer reads from it and adds back.

Converged design choices (modern vs original 2017)

Section titled “Converged design choices (modern vs original 2017)”
ChoiceModern defaultWhy
Norm placementPre-normStable training at depth
Norm typeRMSNormCheaper; no mean-subtraction/bias
FFN activationGated (SwiGLU)Better; hidden ratio ~8/3*d_model
PositionRoPE (rotary)Relative position, length generalization
BiasesDroppedFewer params, more stable
Embedding/outputWeight-tiedShare one matrix
NameMeaning
d_modelResidual-stream width (the dominant dial)
n_layersDepth (number of blocks)
n_headsAttention heads; head_dim = d_model / n_heads
d_ffFFN hidden width (~4d_model, or ~8/3d_model gated)
vocab sizeTokenizer vocabulary
context lengthMax sequence handled
N (non-embedding) ~= 12 * n_layers * d_model^2
per block: ~4*d_model^2 (attention Q,K,V,O) + ~8*d_model^2 (FFN)
embedding ~= vocab_size * d_model

d_model enters squared, so widening grows parameters faster than deepening. Feeds straight into lesson 2’s 6ND (compute) and 16N (memory).

Reading a real config (example field names)

Section titled “Reading a real config (example field names)”
Config fieldThis lesson
hidden_sized_model
num_hidden_layersn_layers
num_attention_headsn_heads
intermediate_sized_ff
rms_norm_epsuses RMSNorm
rope_thetauses RoPE
  • Decoder-only: generates left-to-right; the modern LLM shape.
  • Residual stream: the width-d_model vector each sublayer reads and writes.
  • Pre-norm / RMSNorm / SwiGLU / RoPE: the converged norm-placement / norm-type / FFN-activation / positional choices.
  • Aspect ratio: the balance of width (d_model) to depth (n_layers).
  • Stanford CS336, Lecture 3 (Architectures, hyperparameters), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.