Encoder-decoder, T5, span corruption: brief

What you’ll learn

This is lesson 7 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The previous lessons covered architectural updates to the 2017 transformer: position embeddings (RoPE), normalization (pre-norm and RMSNorm), and attention efficiency (sliding windows, GQA). Course materials are at cme295.stanford.edu.

This lesson opens a new arc on what kinds of transformer-based architectures the field has built. We start with encoder-decoder transformers (the original 2017 architecture is one), walk through the T5 family (T5, mT5 for multilingual, byT5 for byte-level UTF-8 input where each character is 1-4 bytes), build the span corruption pretraining objective through the lecturer’s teddy-bear example (chunks of input masked with sentinel tokens; the decoder reconstructs them), and cover the lecturer’s framing of why the field eventually moved on from encoder-decoder toward decoder-only architectures (which most modern LLMs are). The lecturer is brief on T5; we stay within that brevity.

Where this fits

This is lesson 7 of Phase 2, How models think: the transformer architecture, and the opener of the architectural-variants arc. The previous lesson covered attention efficiency tricks (sliding windows, MQA, GQA). The next two lessons split BERT across two passes: BERT, part one: architecture and BERT, part two: pretraining and fine-tuning. The phase closes with BERT derivatives: DistilBERT and RoBERTa. Together those four lessons cover the encoder-only branch of the architectural tree, just as this lesson covered the encoder-decoder branch.

Before you start

Prerequisites: the transformer block lesson is required. We assume you understand what an encoder is, what a decoder is, what cross-attention does, and what next-token prediction looks like. The decoding lesson in Phase 5 is useful additional context for the contrast with span corruption, but is not required here.

By the end, you’ll be able to

Identify the encoder-decoder architecture from the original 2017 transformer and explain what each half does, including the cross-attention link from decoder to encoder
Name the three members of the T5 family (T5, mT5, byT5) and what makes each distinct, including byT5’s UTF-8 byte tokenizer (1-4 bytes per character) tradeoff
Walk through the span corruption pretraining objective on a worked example and explain how it differs from next-token prediction
Explain the lecturer’s framing of why the field eventually moved to decoder-only architectures despite T5’s strengths

Time and difficulty

Read time: about 18 minutes
Practice time: about 12 minutes (a span-corruption walk-through on a small example, plus a comparison of T5’s training objective with next-token prediction)
Difficulty: standard