From scratch and the tokenizer: cheatsheet

What “from scratch” builds

Layer	You build
Tokenizer	Text to integer tokens (this lesson)
Architecture	The Transformer + its hyperparameters
Loss + optimizer	Cross-entropy + AdamW
Training loop	Ties it together into a model that learns
Systems	Fast kernels, parallelism across devices
Scale + good	Scaling laws, data pipelines, evaluation, post-training

Through-line: efficiency. Always ask: how many FLOPs, how much memory, is the hardware busy?

Why subword (not characters, not words)

Unit	Vocabulary	Sequence length	Problem
Characters/bytes	Tiny	Huge	Expensive; cost grows with length
Words	Exploding	Short	Out-of-vocabulary on unseen words
Subword (BPE)	Bounded	Reasonable	The middle ground

Byte-level BPE: training

1. Start: corpus as bytes (256 base tokens; any string representable)
2. Count every adjacent token pair across the corpus
3. Merge the most frequent pair -> a new token; record the merge rule
4. Repeat until target vocabulary size

Output: a vocabulary + an ordered list of merge rules.
Kind of process: deterministic statistics, NOT gradient descent. Same data -> same tokenizer, no GPU.
Byte-level start = no out-of-vocabulary case (worst case: individual byte tokens).

Byte-level BPE: using it

encode: text -> bytes -> apply learned merges in order -> token IDs
decode: token IDs -> look up -> concatenate bytes -> text
Byte-level means encode-then-decode reproduces the original exactly.

Design choices you own

Choice	Trade-off
Vocabulary size	Larger = shorter sequences (cheaper, more per context) but bigger embedding table + rarer tokens. Typically tens of thousands.
Special tokens	Reserved vocab entries the text lacks (e.g. end-of-document) so the model learns boundaries

Words to use precisely

Token: the integer unit a model processes; the tokenizer’s output.
BPE (byte-pair encoding): subword tokenization by iteratively merging frequent pairs.
Byte-level: BPE whose base alphabet is the 256 bytes, so any string is representable.
Merge rule: a learned “combine pair X into new token Y” instruction; applied in order at encode time.
Out-of-vocabulary (OOV): an input the tokenizer cannot represent; byte-level BPE has none.

Source

Stanford CS336, “Language Modeling from Scratch,” Lecture 1 (Overview, tokenization), by Tatsunori Hashimoto and Percy Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.