Skip to content

Cheatsheet: The transformer block: where everything comes together

A transformer block = multi-head attention
+ position encoding (added once before block 1)
+ feed-forward network
+ residual connections (×2)
+ layer normalization (×2)

Stacked attention alone does not work. Each wrapping piece fixes a specific structural gap.

PieceWhat it addsGap it fixes
Position encodingPosition-dependent vector added to each embeddingAttention is permutation-invariant by default
Feed-forward networkPer-token MLP with nonlinearityAttention is linear in values; without nonlinearity, depth gives no gain
Residual connectionsoutput = input + sub-layer(input) around each sub-layerVanishing gradients in deep stacks; information replacement
Layer normalizationRescale activations across features per tokenActivation scale drift across deep stacks
StageVector
Inputx_0 (shape d_model)
After attention sub-layer (Add+Norm)x_1 = LayerNorm(x_0 + Attention(x_0))
After FFN sub-layer (Add+Norm)x_2 = LayerNorm(x_1 + FFN(x_1))

Shape in equals shape out. That is what lets the next block stack on top.

KnobWhat it controlsTypical range
Layer count (N)Depth of the model6 to 100+
Head count per layer (h)Parallel attention computations per block8 to 32
d_modelMain embedding dimensionA few hundred to several thousand
d_ffFFN hidden dimensionTypically 4 × d_model
AxisOriginal (2017)Modern variants
Position encodingSinusoidalLearned, RoPE, ALiBi
NormalizationLayerNormRMSNorm, DeepNorm
Norm placementPost-LN (after sub-layer)Pre-LN (before sub-layer); standard in modern open-source models
FFN activationReLUGELU, SwiGLU
Attention head sharingFull multi-headMQA, GQA

Why this matters in production (model-card decoder)

Section titled “Why this matters in production (model-card decoder)”
Field in a model cardWhat it controlsLesson term
num_hidden_layersNumber of stacked blocksN
hidden_sizeMain embedding dimensiond_model
intermediate_sizeFFN hidden dimensiond_ff
num_attention_headsHeads per blockh
hidden_actFFN activation functionactivation (ReLU, GELU, SwiGLU)
rms_norm_eps / layer_norm_epsNormalization epsilonLayerNorm or RMSNorm parameter
PitfallReality
Block versus layer”Layer” is overloaded; usually means block, sometimes means sub-layer. Read carefully.
Position encoding added per blockAdded once, before block 1. Carried through every block via residuals.
FFN does cross-token mixingNo. FFN is per-token; attention is the only cross-token operation.
Residuals are decorativeLoad-bearing. A transformer without residuals would not train past a few layers.
LayerNorm equals BatchNormDifferent ops. LayerNorm is per-token (works for variable-length sequences); BatchNorm is per-batch (does not).
  • Block: the unit that wraps multi-head attention with FFN, residuals, and LayerNorm. A transformer is N blocks stacked.
  • N: the number of blocks (model depth).
  • d_model: the main embedding dimension carried into and out of every block.
  • d_ff: the FFN hidden dimension; typically 4 × d_model.
  • h: the number of attention heads per block.
  • Position encoding: a vector added to each embedding before block 1 to give attention a sense of order.
  • Feed-forward network (FFN): per-token two-layer MLP that adds nonlinearity.
  • Residual connection: output = input + sub-layer(input). Enables gradient flow and preserves information.
  • LayerNorm: per-token activation normalization across the feature dimension; learned scale and shift afterward.

Attention is the engine.
The block is the machine.