-> token embedding (-> vectors of width d_model)
residual + attention(norm(x)) # moves info across positions
residual + FFN(norm(x)) # processes each position
-> output projection -> vocab logits # often weight-tied to embedding
The residual stream (width d_model) flows end to end; each sublayer reads from it and adds back.
Choice Modern default Why Norm placement Pre-norm Stable training at depth Norm type RMSNorm Cheaper; no mean-subtraction/bias FFN activation Gated (SwiGLU) Better; hidden ratio ~8/3*d_model Position RoPE (rotary)Relative position, length generalization Biases Dropped Fewer params, more stable Embedding/output Weight-tied Share one matrix
Name Meaning d_modelResidual-stream width (the dominant dial) n_layersDepth (number of blocks) n_headsAttention heads; head_dim = d_model / n_heads d_ffFFN hidden width (~4d_model, or ~8/3 d_model gated) vocab size Tokenizer vocabulary context length Max sequence handled
N (non-embedding) ~= 12 * n_layers * d_model^2
per block: ~4*d_model^2 (attention Q,K,V,O) + ~8*d_model^2 (FFN)
embedding ~= vocab_size * d_model
d_model enters squared , so widening grows parameters faster than deepening. Feeds straight into lesson 2’s 6ND (compute) and 16N (memory).
Config field This lesson hidden_sized_modelnum_hidden_layersn_layersnum_attention_headsn_headsintermediate_sized_ffrms_norm_epsuses RMSNorm rope_thetauses RoPE
Decoder-only : generates left-to-right; the modern LLM shape.
Residual stream : the width-d_model vector each sublayer reads and writes.
Pre-norm / RMSNorm / SwiGLU / RoPE : the converged norm-placement / norm-type / FFN-activation / positional choices.
Aspect ratio : the balance of width (d_model) to depth (n_layers).
Stanford CS336, Lecture 3 (Architectures, hyperparameters), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.