Skip to content

Cheatsheet: What "from scratch" means, and the tokenizer

LayerYou build
TokenizerText to integer tokens (this lesson)
ArchitectureThe Transformer + its hyperparameters
Loss + optimizerCross-entropy + AdamW
Training loopTies it together into a model that learns
SystemsFast kernels, parallelism across devices
Scale + goodScaling laws, data pipelines, evaluation, post-training

Through-line: efficiency. Always ask: how many FLOPs, how much memory, is the hardware busy?

UnitVocabularySequence lengthProblem
Characters/bytesTinyHugeExpensive; cost grows with length
WordsExplodingShortOut-of-vocabulary on unseen words
Subword (BPE)BoundedReasonableThe middle ground
1. Start: corpus as bytes (256 base tokens; any string representable)
2. Count every adjacent token pair across the corpus
3. Merge the most frequent pair -> a new token; record the merge rule
4. Repeat until target vocabulary size
  • Output: a vocabulary + an ordered list of merge rules.
  • Kind of process: deterministic statistics, NOT gradient descent. Same data -> same tokenizer, no GPU.
  • Byte-level start = no out-of-vocabulary case (worst case: individual byte tokens).
  • encode: text -> bytes -> apply learned merges in order -> token IDs
  • decode: token IDs -> look up -> concatenate bytes -> text
  • Byte-level means encode-then-decode reproduces the original exactly.
ChoiceTrade-off
Vocabulary sizeLarger = shorter sequences (cheaper, more per context) but bigger embedding table + rarer tokens. Typically tens of thousands.
Special tokensReserved vocab entries the text lacks (e.g. end-of-document) so the model learns boundaries
  • Token: the integer unit a model processes; the tokenizer’s output.
  • BPE (byte-pair encoding): subword tokenization by iteratively merging frequent pairs.
  • Byte-level: BPE whose base alphabet is the 256 bytes, so any string is representable.
  • Merge rule: a learned “combine pair X into new token Y” instruction; applied in order at encode time.
  • Out-of-vocabulary (OOV): an input the tokenizer cannot represent; byte-level BPE has none.
  • Stanford CS336, “Language Modeling from Scratch,” Lecture 1 (Overview, tokenization), by Tatsunori Hashimoto and Percy Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.