Skip to content

References: How transformers turn input into output: encoder-decoder and T5's span corruption

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the encoder-decoder + T5 section of Stanford CME 295
Lecture 2 (~3760s-4090s). The lecturer is brief on T5; we stay within
that brevity. The next two lessons in this lecture (BERT, BERT
derivatives) cover the encoder-only branch of the architectural tree.
Clawdemy provides original notes, summaries, and quizzes derived from
this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

A short list, chosen for durability.

Topics that build on or sit beside this one.

  • The C4 corpus and pretraining data quality. T5 was trained on the Colossal Clean Crawled Corpus (C4), which became one of the more heavily-cited pretraining datasets and seeded a body of work on data curation. Search terms: “C4 dataset,” “pretraining data quality,” “data deduplication for LLMs.”

  • Encoder-decoder for translation specifically. Translation is the application encoder-decoder was originally designed for. Modern translation systems often still use encoder-decoder transformers (or at least encoder-decoder fine-tunings of decoder-only models). Search terms: “M2M100,” “NLLB (No Language Left Behind),” “machine translation transformer.”

  • Decoder-only generative models, family tree. This lesson sets up the contrast with decoder-only architectures the field consolidated on. The “GPT family” (GPT-2, GPT-3, GPT-Neo, LLaMA, Mistral, and others) are all decoder-only. The text generation lesson (in our Lecture 3 adaptation) covers what decoder-only looks like at runtime.

  • Where to go next. The next lesson in this lecture covers BERT: the encoder-only branch of the architectural tree. The lesson after that covers BERT’s derivatives: DistilBERT (distillation as a compression technique) and RoBERTa (training improvements over the original BERT recipe).

The primary papers, in chronological order.

None selected for this lesson. The encoder-decoder vs decoder-only architectural choice has been extensively discussed in practitioner blogs and academic literature; specific durable threads are hard to pin without rotating quickly. Durable references will be added at a future quarterly review if any consolidate.