Skip to content

Cheatsheet: How transformers turn input into output: encoder-decoder and T5's span corruption

Three architectural branches, three pretraining objectives:
Encoder-decoder + span corruption → T5 family
Encoder-only + masked language model → BERT family (next lesson)
Decoder-only + next-token prediction → Modern LLMs
Decoder-only won at scale because next-token prediction
is the simplest training task and scales the best.
Encoder-decoderEncoder-onlyDecoder-only
StacksTwoOne (encoder)One (decoder)
Cross-attention?Yes (decoder back to encoder)No (no decoder)No (no encoder)
Original example2017 transformer (machine translation)BERT (next lesson)GPT-style models
Modern examplesT5 familyBERT, RoBERTa, DistilBERTMost modern LLMs
Naturally good atTranslation, span-fill tasksClassification, embeddingsGeneration, chat, completion
MemberWhat it adds
T5 (vanilla)Original Text-to-Text Transfer Transformer; encoder-decoder; span corruption pretraining
mT5 (multilingual T5)Broader training data + broader vocabulary; same architecture
byT5 (byte-level T5)No tokenizer; vocabulary fixed at 256 entries (one per byte value); every character representable in at most two bytes
Original sentence:
"my teddy bear is cute and reading"
Encoder input (with sentinels marking masked spans):
my [SENTINEL_1] is [SENTINEL_2] reading
Decoder output (each span preceded by its sentinel):
[SENTINEL_1] teddy bear [SENTINEL_2] cute and [SENTINEL_3]
The trailing [SENTINEL_3] marks end-of-sequence.
ElementPurpose
SpanOne or more consecutive tokens to be masked. Multiple spans per training example, up to a configured maximum.
Sentinel tokenSpecial token that marks the position of a masked span (in the encoder input) and delimits the reconstructed content (in the decoder output).
Teacher forcingThe full correct decoder output is shown at training time; loss is computed across all positions at once, not autoregressively.
Span corruption (T5)Next-token prediction (most modern LLMs)
Training signalReconstruct masked spansPredict next token at each position
Setup complexityHigher (corruption process, sentinels, special decoder behavior)Minimal (just feed raw text)
Decoder behaviorOutputs sentinels and masked contentExtends sequence one token at a time
Aligns naturally withFill-in-the-blanks tasksChat assistants, generation, completion
Scales how?Worked at the time; less central nowWon at scale; the lecturer’s framing: “next word prediction is the simplest thing you can do”
ReasonDetail
Compute budget pays off in the decoder”Your compute budget could be best invested in the decoder only.” Investment goes into the part that does generation.
Next-token prediction is simplerSimplest possible training task. No corruption process, no sentinels, no special decoder behavior.
Aligns with the chat-assistant taskModern LLMs are mostly used as chat assistants. Text in, text response out is exactly what next-token prediction trains.
PhraseWhat it means
T5Encoder-decoder architecture, span corruption pretraining; ships especially for multilingual NLP via mT5
mT5Multilingual T5; workhorse for cross-lingual tasks
byT5Byte-level T5; no tokenizer; rare in practice but useful when tokenizers fight you
Encoder-decoderTwo-stack architecture with cross-attention; mostly historical for new builds
Decoder-onlyOne-stack architecture, no cross-attention; what most modern LLMs are
PitfallReality
Encoder-decoder = seq2seqNo. Encoder-decoder is one architecture (two transformer stacks with cross-attention). Seq2seq is a broader category that includes encoder-decoder transformers but also pre-transformer LSTM-based seq2seq models.
T5 is obsoleteLess central than it was, still ships. mT5 is a workhorse for multilingual tasks.
Span corruption and next-token prediction are interchangeableDifferent shapes of pretraining produce different downstream behaviors. They train the model on different patterns.
All encoder-decoder models use span corruptionNo. The original 2017 transformer was an encoder-decoder trained on next-token prediction (machine translation as sequence completion). Span corruption is T5’s distinctive choice, not an encoder-decoder requirement.
  • Encoder-decoder transformer: architecture with two stacks of transformer blocks; the decoder uses cross-attention to look back at the encoder’s representations.
  • Encoder-only transformer: architecture with just the encoder stack; the BERT family. Next lesson.
  • Decoder-only transformer: architecture with just the decoder stack; what most modern LLMs are.
  • Cross-attention: in the decoder, the attention layer that lets output tokens attend back to the encoder’s representations.
  • T5 (Text-to-Text Transfer Transformer): encoder-decoder transformer family with span corruption pretraining and a text-in / text-out framing for all tasks.
  • mT5: multilingual T5; broader data and vocabulary.
  • byT5: byte-level T5; vocabulary of 256, no tokenizer.
  • Span corruption: T5’s pretraining objective. Mask spans of input behind sentinel tokens; the decoder reconstructs each masked span in series.
  • Sentinel token: special token that marks the position of a masked span in the input and delimits the reconstructed content in the output.
  • Teacher forcing: training mechanism where the full correct output is shown at training time and the loss is computed across all positions at once (not autoregressively).

Encoder-decoder has two stacks.
T5 added span corruption.
Decoder-only won on simplicity.