References: How transformers turn input into output: encoder-decoder and T5's span corruption

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the encoder-decoder + T5 section of Stanford CME 295
Lecture 2 (~3760s-4090s). The lecturer is brief on T5; we stay within
that brevity. The next two lessons in this lecture (BERT, BERT
derivatives) cover the encoder-only branch of the architectural tree.
Clawdemy provides original notes, summaries, and quizzes derived from
this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, Raffel et al., 2020. The T5 paper. The text-to-text framing, the architectural choices, the C4 pretraining corpus, and the empirical comparisons across many NLP benchmarks. Long but readable; sections 1, 2, and 3 are the conceptual core.
“mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer”, Xue et al., 2021. The mT5 paper. Extends T5 to 101 languages. The vocabulary and data-collection sections are the meaningful changes from vanilla T5; the architecture is otherwise identical.
“ByT5: Towards a token-free future with pre-trained byte-to-byte models”, Xue et al., 2022. The byT5 paper. Argues for byte-level modeling and shows the trade-offs (longer sequences, but no tokenizer-related artifacts).
“Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper, which introduced the encoder-decoder shape this lesson recaps from. Section 3 covers the architecture; section 4 covers the machine-translation training objective.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The transformer-based models section gives a one-page reference for the same material in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The C4 corpus and pretraining data quality. T5 was trained on the Colossal Clean Crawled Corpus (C4), which became one of the more heavily-cited pretraining datasets and seeded a body of work on data curation. Search terms: “C4 dataset,” “pretraining data quality,” “data deduplication for LLMs.”
Encoder-decoder for translation specifically. Translation is the application encoder-decoder was originally designed for. Modern translation systems often still use encoder-decoder transformers (or at least encoder-decoder fine-tunings of decoder-only models). Search terms: “M2M100,” “NLLB (No Language Left Behind),” “machine translation transformer.”
Decoder-only generative models, family tree. This lesson sets up the contrast with decoder-only architectures the field consolidated on. The “GPT family” (GPT-2, GPT-3, GPT-Neo, LLaMA, Mistral, and others) are all decoder-only. The text generation lesson (in our Lecture 3 adaptation) covers what decoder-only looks like at runtime.
Where to go next. The next lesson in this lecture covers BERT: the encoder-only branch of the architectural tree. The lesson after that covers BERT’s derivatives: DistilBERT (distillation as a compression technique) and RoBERTa (training improvements over the original BERT recipe).

Original sources

The primary papers, in chronological order.

“Attention Is All You Need”, Vaswani et al., 2017. The original encoder-decoder transformer.
“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, Raffel et al., 2020. T5.
“mT5”, Xue et al., 2021. Multilingual T5.
“ByT5”, Xue et al., 2022. Byte-level T5.

Community discussion

None selected for this lesson. The encoder-decoder vs decoder-only architectural choice has been extensively discussed in practitioner blogs and academic literature; specific durable threads are hard to pin without rotating quickly. Durable references will be added at a future quarterly review if any consolidate.