The Transformer architecture: brief

What you’ll learn

This lesson is the model itself, and its reassuring message is that “the Transformer architecture” is one skeleton with a few switched settings, not a sprawling design space. The source curriculum is Stanford CS336, Lecture 3, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will learn the decoder-only skeleton and the residual stream that organizes it; the design choices modern LLMs converged on (pre-norm, RMSNorm, gated SwiGLU activations, RoPE positions, no biases, weight tying) and why; the hyperparameters that size a model; how to estimate parameter count from d_model and n_layers; and how to read a real model config in terms of all of this.

Where this fits

This is lesson 3 of 14, the third lesson of Phase 1 (the model). It builds the architecture that lesson 2’s cost accounting applies to, and the 12 * n_layers * d_model^2 parameter formula here connects the design directly to FLOPs and memory. The next lesson varies this skeleton’s attention sublayer (attention alternatives and mixture of experts), closing Phase 1; the scaling-laws lesson later decides how to set these hyperparameters.

Before you start

Prerequisites: lesson 2 (the FLOP and memory accounting this lesson ties parameter count to). You should already know what attention and a feed-forward layer are (from Track 5, Track 13, or equivalent); this lesson assumes those and focuses on how a real LLM is assembled and why. Comfort with the cost vocabulary from lesson 2 (d_model, parameters, FLOPs) helps.

About the math

Light. The architecture is described structurally, and the only calculation is the parameter-count estimate (12 * n_layers * d_model^2), which is multiplication. No derivations of attention or backpropagation; those are assumed background.

By the end, you’ll be able to

The single capability this lesson builds: describe the architectural choices and hyperparameters that define a Transformer, and how they trade off. Concretely, you will be able to:

Describe the decoder-only Transformer skeleton and the residual stream
Name the design choices modern LLMs converged on and why
List the hyperparameters that size a Transformer
Estimate parameter count from d_model and n_layers
Read a real model config in terms of these choices and hyperparameters

Time and difficulty

Read time: about 14 minutes
Practice time: about 12 minutes (estimate a model’s parameters and read a config, plus flashcards)
Difficulty: deep (Stage C; assumes attention/FFN background, focuses on assembly and sizing)