Pretraining: how a model learns language by predicting the next word
What you’ll learn
Section titled “What you’ll learn”This is the opening lesson of Phase 3 (How models are trained at scale) in Track 5 (AI Foundations). You’ve seen the gear: tokens, embeddings, attention, the transformer block, the BERT family. What you haven’t seen is what actually makes a model good at language.
For the decoder-only architecture that dominates generative AI today (more than 90 percent of modern LLMs by the lecturer’s count), the answer is almost embarrassingly simple. You take a vast amount of text (most of the open internet, code repositories, books, articles), you show the model a piece of text, and you ask it to predict the next word. Then the next. Then the next, trillions of times. That single objective, repeated at scale, is what produces almost every language ability you experience when you talk to a modern AI. The lesson walks one training step concretely (predict the next token from a prefix, compute cross-entropy loss against whatever was actually next, adjust the weights very slightly), names Common Crawl as the dominant data source, and grounds the scale (Llama 4 Scout was trained on roughly 40 trillion tokens; frontier scale has roughly doubled to tripled since Llama 3’s 15T). Pretraining is the most expensive single thing in modern AI (millions of dollars per run, months of GPU time on large clusters), and almost everything else later in this curriculum is a smaller, cheaper stage on top of it.
Where this fits
Section titled “Where this fits”This is the opening lesson of Phase 3, How models are trained at scale. Phase 3 builds the capability to describe what it takes to build a frontier model, and why most organizations cannot. This lesson focuses on the central training objective itself: next-token prediction at internet scale, the simple idea that produces almost all of modern language ability. The next three Phase 3 lessons cover Why scale matters (Kaplan, Chinchilla, and inference-time scaling), Parallelism and Flash Attention (how the training run is actually distributed across GPUs), and Quantization and mixed precision (the precision-vs-throughput tradeoffs that shape modern training and inference). As the Phase 3 opener it builds directly on Phase 2’s architecture work; the previous lesson in the curriculum is the Phase 2 closer on BERT derivatives.
Before you start
Section titled “Before you start”Prerequisites: the BERT derivatives lesson (the Phase 2 closer; transitively assumes the rest of the architecture phase). You should be comfortable with what attention is, what the transformer block does, and the encoder/decoder distinction. No math, no code.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Describe next-token prediction as the dominant pretraining objective for decoder-only models, and explain why decoder-only architectures dominate generative AI today (the lecturer’s “more than 90 percent” framing)
- Explain why training on the internet at scale produces general-purpose language ability without any task-specific labeling
- Recognize Common Crawl as the dominant pretraining data source, alongside code repositories (GitHub) and curated text (books, articles)
- Walk one pretraining step concretely (predict the next token from a prefix; training signal is whatever was actually next; cross-entropy loss is the negative log of the probability the model assigned to the right answer)
- Distinguish the pretraining stage from the later tuning stages, and ground the scale (Llama 4 Scout’s ~40 trillion tokens; frontier scale has roughly doubled to tripled since Llama 3)
Time and difficulty
Section titled “Time and difficulty”- Read time: about 22 minutes
- Practice time: about 15 minutes (a short worked example tracing one pretraining step, plus flashcards)
- Difficulty: standard