Skip to content

The Transformer architecture and its hyperparameters

This lesson is the model itself, and its reassuring message is that “the Transformer architecture” is one skeleton with a few switched settings, not a sprawling design space. The source curriculum is Stanford CS336, Lecture 3, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will learn the decoder-only skeleton and the residual stream that organizes it; the design choices modern LLMs converged on (pre-norm, RMSNorm, gated SwiGLU activations, RoPE positions, no biases, weight tying) and why; the hyperparameters that size a model; how to estimate parameter count from d_model and n_layers; and how to read a real model config in terms of all of this.

This is lesson 3 of 14, the third lesson of Phase 1 (the model). It builds the architecture that lesson 2’s cost accounting applies to, and the 12 * n_layers * d_model^2 parameter formula here connects the design directly to FLOPs and memory. The next lesson varies this skeleton’s attention sublayer (attention alternatives and mixture of experts), closing Phase 1; the scaling-laws lesson later decides how to set these hyperparameters.

Prerequisites: lesson 2 (the FLOP and memory accounting this lesson ties parameter count to). You should already know what attention and a feed-forward layer are (from Track 5, Track 13, or equivalent); this lesson assumes those and focuses on how a real LLM is assembled and why. Comfort with the cost vocabulary from lesson 2 (d_model, parameters, FLOPs) helps.

Light. The architecture is described structurally, and the only calculation is the parameter-count estimate (12 * n_layers * d_model^2), which is multiplication. No derivations of attention or backpropagation; those are assumed background.

The single capability this lesson builds: describe the architectural choices and hyperparameters that define a Transformer, and how they trade off. Concretely, you will be able to:

  • Describe the decoder-only Transformer skeleton and the residual stream
  • Name the design choices modern LLMs converged on and why
  • List the hyperparameters that size a Transformer
  • Estimate parameter count from d_model and n_layers
  • Read a real model config in terms of these choices and hyperparameters
  • Read time: about 14 minutes
  • Practice time: about 12 minutes (estimate a model’s parameters and read a config, plus flashcards)
  • Difficulty: deep (Stage C; assumes attention/FFN background, focuses on assembly and sizing)