Summary: What "from scratch" means, and the tokenizer

This track builds an LLM from scratch, the real thing. “From scratch” means building the whole stack yourself: the tokenizer, the Transformer architecture, the loss and optimizer, the training loop, and then the systems (kernels, parallelism), scaling laws, data pipelines, evaluation, and post-training that make a model efficient and good. The through-line is efficiency: at every step you account for compute (FLOPs), memory, and whether the hardware is busy. The lesson then starts where a model starts, the tokenizer, which converts text into the integer tokens everything downstream operates on. Characters make sequences too long and words make the vocabulary explode, so the standard is byte-level BPE: start from bytes (256 tokens, any string representable, no out-of-vocabulary problem), then learn merges of the most frequent adjacent pairs up to a chosen vocabulary size. It is deterministic statistics, not training. This is the scan version; the lesson lays out the road and builds the first component.

Core ideas

“From scratch” is the whole stack. Tokenizer, architecture, loss (cross-entropy), optimizer (AdamW), and training loop, plus the systems, scaling, data, evaluation, and post-training that the rest of the track covers. Not reinventing math; building the language-model-specific layers.
Efficiency is the spine. Every step is a decision about FLOPs, memory, and hardware utilization. Building an LLM is mostly precise decisions about compute and data.
The tokenizer is the first component. It turns text into integer tokens; everything downstream, including cost and context limits, is measured in those tokens.
Characters vs words both fail. Characters give tiny vocabularies but enormous sequences (expensive); words give short sequences but an exploding vocabulary and the out-of-vocabulary problem. Subword tokens are the middle ground.
Byte-level BPE is the standard. Start from bytes (256, so any string is representable, no unknown tokens), then learn merges: count adjacent pairs, merge the most frequent into a new token, repeat to the target vocab size. Output: a vocabulary plus ordered merge rules. Deterministic, no GPU.
You own the trade-offs. Vocabulary size (shorter sequences vs. a bigger embedding table) and special tokens. Deliberate choices, no single right answer.

What changes for you

This lesson sets the posture for the whole track and makes the first concept concrete. The posture: a language model is not magic that consumes “language,” it is a system that processes integer tokens produced by a procedure you can write yourself, and you understand each layer by building it. The concept: the tokenizer is the foundation the model’s economics rest on, because the token is the unit your compute budget, context length, and data costs are all denominated in. Building it yourself turns “the model understands text” into “the model processes tokens I defined,” which is the mental shift the from-scratch project is built on. The next lesson stays at this foundational level but turns to the thing the whole course revolves around: counting the cost of a model in FLOPs and memory, the accounting that makes every later design choice quantifiable.

A language model does not read text; it processes a sequence of integers a tokenizer you can build produces. That is the first thing “from scratch” demystifies, and the rest of the track is built on those tokens.