A transformer block = multi-head attention
+ position encoding (added once before block 1)
+ residual connections (×2)
+ layer normalization (×2)
Stacked attention alone does not work. Each wrapping piece fixes a specific structural gap.
| Piece | What it adds | Gap it fixes |
|---|
| Position encoding | Position-dependent vector added to each embedding | Attention is permutation-invariant by default |
| Feed-forward network | Per-token MLP with nonlinearity | Attention is linear in values; without nonlinearity, depth gives no gain |
| Residual connections | output = input + sub-layer(input) around each sub-layer | Vanishing gradients in deep stacks; information replacement |
| Layer normalization | Rescale activations across features per token | Activation scale drift across deep stacks |
| Stage | Vector |
|---|
| Input | x_0 (shape d_model) |
| After attention sub-layer (Add+Norm) | x_1 = LayerNorm(x_0 + Attention(x_0)) |
| After FFN sub-layer (Add+Norm) | x_2 = LayerNorm(x_1 + FFN(x_1)) |
Shape in equals shape out. That is what lets the next block stack on top.
| Knob | What it controls | Typical range |
|---|
Layer count (N) | Depth of the model | 6 to 100+ |
Head count per layer (h) | Parallel attention computations per block | 8 to 32 |
d_model | Main embedding dimension | A few hundred to several thousand |
d_ff | FFN hidden dimension | Typically 4 × d_model |
| Axis | Original (2017) | Modern variants |
|---|
| Position encoding | Sinusoidal | Learned, RoPE, ALiBi |
| Normalization | LayerNorm | RMSNorm, DeepNorm |
| Norm placement | Post-LN (after sub-layer) | Pre-LN (before sub-layer); standard in modern open-source models |
| FFN activation | ReLU | GELU, SwiGLU |
| Attention head sharing | Full multi-head | MQA, GQA |
| Field in a model card | What it controls | Lesson term |
|---|
num_hidden_layers | Number of stacked blocks | N |
hidden_size | Main embedding dimension | d_model |
intermediate_size | FFN hidden dimension | d_ff |
num_attention_heads | Heads per block | h |
hidden_act | FFN activation function | activation (ReLU, GELU, SwiGLU) |
rms_norm_eps / layer_norm_eps | Normalization epsilon | LayerNorm or RMSNorm parameter |
| Pitfall | Reality |
|---|
| Block versus layer | ”Layer” is overloaded; usually means block, sometimes means sub-layer. Read carefully. |
| Position encoding added per block | Added once, before block 1. Carried through every block via residuals. |
| FFN does cross-token mixing | No. FFN is per-token; attention is the only cross-token operation. |
| Residuals are decorative | Load-bearing. A transformer without residuals would not train past a few layers. |
| LayerNorm equals BatchNorm | Different ops. LayerNorm is per-token (works for variable-length sequences); BatchNorm is per-batch (does not). |
- Block: the unit that wraps multi-head attention with FFN, residuals, and LayerNorm. A transformer is
N blocks stacked.
N: the number of blocks (model depth).
d_model: the main embedding dimension carried into and out of every block.
d_ff: the FFN hidden dimension; typically 4 × d_model.
h: the number of attention heads per block.
- Position encoding: a vector added to each embedding before block 1 to give attention a sense of order.
- Feed-forward network (FFN): per-token two-layer MLP that adds nonlinearity.
- Residual connection:
output = input + sub-layer(input). Enables gradient flow and preserves information.
- LayerNorm: per-token activation normalization across the feature dimension; learned scale and shift afterward.
Attention is the engine.
The block is the machine.