References: The transformer block: where everything comes together

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025, Lecture 1
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  YouTube: https://www.youtube.com/watch?v=Ub3GoFaUcds
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  License (lecture video): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lecture remain with Stanford and
the instructors.

Going deeper

A short list, chosen for durability. Each link is a specific next step, not a generic “learn more.”

The Illustrated Transformer by Jay Alammar. The canonical visual treatment of the full transformer block on the public web. Where this lesson kept the math minimal, Alammar draws every matrix shape and walks the residual paths in color. Read this if you want to see every box drawn out with full color and shape annotations.
The Annotated Transformer from Harvard NLP. Reproduces the original Vaswani paper’s architecture in PyTorch alongside the paper’s text. The fastest path from this lesson to runnable code. Especially useful for the Add+Norm and FFN sub-layers, where reading the implementation makes the order-of-operations obvious.
3Blue1Brown’s transformer series. Grant Sanderson covers the full block as part of his neural-networks playlist; the transformer-architecture episode and the multi-head attention episode are both worth watching after this lesson. Episode numbering in the playlist has shifted over time; sort by date or scan the titles to find the transformer entries.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The transformer-architecture page is the densest one-page reference for the full block; useful as a single-sheet recall aid during the practice section.
A Survey of Transformers by Lin et al., 2021. A taxonomy of transformer variants up through 2021, organized by which sub-layer is varied. Useful for the structural map even though specific variants have moved on; read it after this lesson for a tour of how researchers have substituted on each of the boxes covered here.

Adjacent topics

Topics that build on or sit beside this one.

What is NOT in a vanilla transformer block. Modern open-source models add several pieces this lesson does not cover: KV caching for inference, sliding-window or sparse attention for long contexts, FlashAttention for memory efficiency, and mixture-of-experts (MoE) feed-forward variants. Each is its own optimization on top of the basic block; understanding this lesson is the prerequisite for understanding any of them.
Training, prompting, RLHF. This lesson covers the architecture; turning a trained transformer into a useful assistant involves separate machinery (instruction tuning, reinforcement learning from human feedback, prompt design, system prompts). Those are tracks of their own, post-Stanford-POC.
Encoder versus decoder versus encoder-decoder. The original transformer paper had both an encoder and a decoder; modern decoder-only models (GPT-style) drop the encoder. The block-level mechanics this lesson covers are the same; the difference is in how blocks are stacked and which attention masks are applied. Worth a future lesson on its own.
Where to go next. This lesson closes the Stanford POC track. You now have the architecture in your head; the natural follow-on tracks are the ones that operate on top of it: how to actually use AI in real workflows (Track 1 on Clawdemy), how to build agents (Track 3, planned), and how to use AI without trading away your data (Track 6: Privacy & Local-First AI, planned). The architecture stops being the bottleneck once you understand it; the next bottleneck is workflow.

Original sources

The primary sources for the architecture this lesson covers.

“Attention Is All You Need”, Vaswani et al., NeurIPS 2017. The transformer paper. Section 3 lays out the full block (multi-head attention, feed-forward, residuals, LayerNorm, sinusoidal position encoding); Figure 1 is the architecture diagram this lesson’s practice section asked you to annotate. The single most-cited paper in modern AI; read it after this lesson and almost every diagram will be familiar.
“Layer Normalization”, Ba, Kiros, Hinton, 2016. The LayerNorm paper. Predates the transformer by a year and was originally proposed for recurrent networks; the transformer adopted it because it works for variable-length sequences. Read for the comparison with BatchNorm.
“Deep Residual Learning for Image Recognition”, He et al., 2016. The ResNet paper. Introduced residual connections for very deep image networks; the transformer borrowed the idea wholesale. The single most important “make deep networks trainable” insight of the deep learning era.
“Root Mean Square Layer Normalization”, Zhang and Sennrich, 2019. The RMSNorm paper. Argues that the mean-centering step in LayerNorm is unnecessary for transformer performance; RMSNorm is now standard in many large open-source models because it is computationally cheaper.
“RoFormer: Enhanced Transformer with Rotary Position Embedding”, Su et al., 2021. The RoPE paper. Proposes rotary position embeddings as a replacement for sinusoidal position encoding; widely adopted by recent open-source transformers because it generalizes better to sequence lengths beyond what was seen during training.

Community discussion

None selected for this lesson. The public discussion of the full transformer block has consolidated into the Alammar visual post, the Annotated Transformer, and the academic papers above. If a canonical thread surfaces, it will be added at the next quarterly review.