Cheatsheet: What "from scratch" means, and the tokenizer
What “from scratch” builds
Section titled “What “from scratch” builds”| Layer | You build |
|---|---|
| Tokenizer | Text to integer tokens (this lesson) |
| Architecture | The Transformer + its hyperparameters |
| Loss + optimizer | Cross-entropy + AdamW |
| Training loop | Ties it together into a model that learns |
| Systems | Fast kernels, parallelism across devices |
| Scale + good | Scaling laws, data pipelines, evaluation, post-training |
Through-line: efficiency. Always ask: how many FLOPs, how much memory, is the hardware busy?
Why subword (not characters, not words)
Section titled “Why subword (not characters, not words)”| Unit | Vocabulary | Sequence length | Problem |
|---|---|---|---|
| Characters/bytes | Tiny | Huge | Expensive; cost grows with length |
| Words | Exploding | Short | Out-of-vocabulary on unseen words |
| Subword (BPE) | Bounded | Reasonable | The middle ground |
Byte-level BPE: training
Section titled “Byte-level BPE: training”1. Start: corpus as bytes (256 base tokens; any string representable)2. Count every adjacent token pair across the corpus3. Merge the most frequent pair -> a new token; record the merge rule4. Repeat until target vocabulary size- Output: a vocabulary + an ordered list of merge rules.
- Kind of process: deterministic statistics, NOT gradient descent. Same data -> same tokenizer, no GPU.
- Byte-level start = no out-of-vocabulary case (worst case: individual byte tokens).
Byte-level BPE: using it
Section titled “Byte-level BPE: using it”- encode: text -> bytes -> apply learned merges in order -> token IDs
- decode: token IDs -> look up -> concatenate bytes -> text
- Byte-level means encode-then-decode reproduces the original exactly.
Design choices you own
Section titled “Design choices you own”| Choice | Trade-off |
|---|---|
| Vocabulary size | Larger = shorter sequences (cheaper, more per context) but bigger embedding table + rarer tokens. Typically tens of thousands. |
| Special tokens | Reserved vocab entries the text lacks (e.g. end-of-document) so the model learns boundaries |
Words to use precisely
Section titled “Words to use precisely”- Token: the integer unit a model processes; the tokenizer’s output.
- BPE (byte-pair encoding): subword tokenization by iteratively merging frequent pairs.
- Byte-level: BPE whose base alphabet is the 256 bytes, so any string is representable.
- Merge rule: a learned “combine pair X into new token Y” instruction; applied in order at encode time.
- Out-of-vocabulary (OOV): an input the tokenizer cannot represent; byte-level BPE has none.
Source
Section titled “Source”- Stanford CS336, “Language Modeling from Scratch,” Lecture 1 (Overview, tokenization), by Tatsunori Hashimoto and Percy Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.