Skip to content

Summary: assembling and training the full GPT

TL;DR. One self-attention computation is not a transformer. This lesson assembles the full GPT: run several attention heads in parallel (multi-head attention), pair them with a per-token feed-forward layer to make a transformer block (“talk, then think”), make a deep stack of blocks trainable with residual connections and layer normalization, give the order-blind model a sense of sequence with positional embeddings, and finish with a softmax head. Train it on next-token cross-entropy and generate autoregressively. This is the exact architecture behind every modern large language model.

  • Multi-head attention. Several attention heads run in parallel, each with its own query/key/value, tracking different relationships; their outputs concatenate and project. With width 64 and 8 heads, each head works in 8 dimensions, concatenating back to 64. Heads are parallel; depth comes from stacking blocks.

  • The transformer block: communication then computation. Multi-head self-attention lets tokens gather context from each other; a per-token feed-forward MLP lets each token process what it gathered. This pair is the unit repeated to make the model deep.

  • Residuals and layer norm make the depth trainable. A residual connection (x + sublayer(x)) gives the gradient a highway back to early blocks (addition passes gradients through), preventing vanishing gradients; layer normalization keeps each token’s representation in a trainable range. Both are mandatory in a deep transformer.

  • Positional embeddings give a sense of order. Attention is order-blind (a weighted sum sees a set), so a learned position embedding is added to each token embedding, encoding “this token, at this position.”

  • The full GPT, trained on next-token prediction. Token + position embeddings, a stack of blocks, a final layer norm, a linear head to vocabulary logits, softmax. The shape stays (T, width) through every block (so blocks stack) and only changes at the final projection to (T, vocab). Trained on cross-entropy (the p - y gradient), generated autoregressively.

A large language model is no longer a black box. It is the synthesis of the whole track: an autograd engine computing gradients, a training loop nudging parameters downhill, embeddings turning symbols into vectors, attention routing information between them, residuals and normalization keeping a deep stack alive, and a softmax predicting the next token, every piece of which you have now built from nothing. The model behind a commercial chatbot is this exact architecture, only larger (more blocks and heads), with a bigger token vocabulary, trained on far more text, and finished with a fine-tuning stage. The skeleton is identical. One piece has been assumed throughout, that text arrives already split into tokens. The final lesson builds the tokenizer, the component that turns raw text into the token IDs a GPT consumes and back again.