How transformers turn input into output: encoder-decoder and T5's span corruption
What you’ll learn
Section titled “What you’ll learn”This is lesson 7 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The previous lessons covered architectural updates to the 2017 transformer: position embeddings (RoPE), normalization (pre-norm and RMSNorm), and attention efficiency (sliding windows, GQA). Course materials are at cme295.stanford.edu.
This lesson opens a new arc on what kinds of transformer-based architectures the field has built. We start with encoder-decoder transformers (the original 2017 architecture is one), walk through the T5 family (T5, mT5 for multilingual, byT5 for byte-level UTF-8 input where each character is 1-4 bytes), build the span corruption pretraining objective through the lecturer’s teddy-bear example (chunks of input masked with sentinel tokens; the decoder reconstructs them), and cover the lecturer’s framing of why the field eventually moved on from encoder-decoder toward decoder-only architectures (which most modern LLMs are). The lecturer is brief on T5; we stay within that brevity.
Where this fits
Section titled “Where this fits”This is lesson 7 of Phase 2, How models think: the transformer architecture, and the opener of the architectural-variants arc. The previous lesson covered attention efficiency tricks (sliding windows, MQA, GQA). The next two lessons split BERT across two passes: BERT, part one: architecture and BERT, part two: pretraining and fine-tuning. The phase closes with BERT derivatives: DistilBERT and RoBERTa. Together those four lessons cover the encoder-only branch of the architectural tree, just as this lesson covered the encoder-decoder branch.
Before you start
Section titled “Before you start”Prerequisites: the transformer block lesson is required. We assume you understand what an encoder is, what a decoder is, what cross-attention does, and what next-token prediction looks like. The decoding lesson in Phase 5 is useful additional context for the contrast with span corruption, but is not required here.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Identify the encoder-decoder architecture from the original 2017 transformer and explain what each half does, including the cross-attention link from decoder to encoder
- Name the three members of the T5 family (T5, mT5, byT5) and what makes each distinct, including byT5’s UTF-8 byte tokenizer (1-4 bytes per character) tradeoff
- Walk through the span corruption pretraining objective on a worked example and explain how it differs from next-token prediction
- Explain the lecturer’s framing of why the field eventually moved to decoder-only architectures despite T5’s strengths
Time and difficulty
Section titled “Time and difficulty”- Read time: about 18 minutes
- Practice time: about 12 minutes (a span-corruption walk-through on a small example, plus a comparison of T5’s training objective with next-token prediction)
- Difficulty: standard