Summary: How transformers turn input into output: encoder-decoder and T5's span corruption
The original 2017 transformer was an encoder-decoder. Two stacks, with cross-attention from the decoder back to the encoder. The most influential family that kept this architecture is T5 (Text-to-Text Transfer Transformer). Its claim to fame is not the architecture (essentially the original transformer) but the pretraining objective: instead of next-token prediction, T5 was trained on span corruption, where chunks of the input are masked and the decoder reconstructs them. Modern LLMs went a different direction (decoder-only) because next-token prediction scaled simpler. There is also a third branch, encoder-only, which the next lesson covers through BERT.
This summary is the scan-it-in-five-minutes version. The full lesson walks the T5 family (T5, mT5, byT5), the span corruption mechanism, and the lecturer’s framing of why decoder-only architectures came to dominate.
Core ideas
Section titled “Core ideas”- Three architectural branches. Encoder-decoder (original transformer + T5 family), encoder-only (BERT family, next lesson), decoder-only (modern LLMs). Each shape determines what the model is naturally good at.
- Encoder-decoder, recap. Two stacks. The encoder reads the input through self-attention and feed-forward blocks. The decoder writes the output through masked self-attention plus cross-attention back to the encoder. Designed for translation; the input and output are different things.
- T5 stands for Text-to-Text Transfer Transformer. Every NLP task gets framed as text-in, text-out. The same architecture handles translate, summarize, classify, answer-questions by changing the input prompt.
- Three members of the T5 family. Vanilla T5; mT5 (multilingual, broader data and vocabulary); byT5 (byte-level, no tokenizer, vocabulary fixed at 256 entries since every byte value gets one).
- Span corruption is T5’s pretraining objective. Mask spans of the input behind sentinel tokens; the decoder reconstructs each masked span in series, each preceded by the sentinel that marked its position in the input. Walking example: “my [SENTINEL_1] is [SENTINEL_2] reading” in the encoder produces “[SENTINEL_1] teddy bear [SENTINEL_2] cute and [SENTINEL_3]” in the decoder.
- Training uses teacher forcing. The full correct decoder output is shown at training time and the loss is computed across all positions at once, not autoregressively.
- Decoder-only eventually dominated. Per the lecturer: “your compute budget could be best invested in the decoder only.” Two strands: next-token prediction is the simplest possible training task and scales better; modern LLMs are mostly used as chat assistants, which is exactly what next-token prediction trains. Span corruption is more bespoke (corruption process, sentinel tokens, special decoder behavior), and the field traded that complexity for the simplicity of pure next-token prediction at scale.
- T5 still ships. Especially mT5 for multilingual tasks where the encoder-decoder shape with span-corruption pretraining still earns its place. Encoder-decoder is mostly historical for new builds.
- Pitfall: conflating encoder-decoder with seq2seq. Encoder-decoder is one architecture; seq2seq is a broader category that includes encoder-decoder transformers but also pre-2017 LSTM-based seq2seq models.
- Pitfall: thinking T5 is obsolete. Less central than it was, still ships. mT5 in particular is a workhorse for multilingual tasks.
- Pitfall: treating span corruption and next-token prediction as interchangeable. Different shapes of pretraining produce different downstream behaviors.
- Pitfall: assuming all encoder-decoder models use span corruption. The original 2017 transformer was an encoder-decoder trained on next-token prediction. Span corruption is T5’s distinctive choice, not an encoder-decoder requirement.
What changes for you
Section titled “What changes for you”When you read about a model and see “T5” or “mT5,” you now know what kind of architecture is doing the work and what kind of pretraining shaped its weights. When you read about a “decoder-only” LLM, you know what was dropped (the encoder, cross-attention) and why (per the lecturer’s argument). The encoder-decoder vs decoder-only split is one of the more durable taxonomy lines in the modern transformer landscape; this lesson maps where each kind of model sits.
Encoder-decoder has two stacks.
T5 added span corruption.
Decoder-only won on simplicity.