References: Pretraining: how a model learns language by predicting the next word
Source material
Section titled “Source material”Source material:• Stanford CME 295: Transformers & Large Language Models, Autumn 2025 Instructor: Afshine Amidi & Shervine Amidi, Stanford University Course site: https://cme295.stanford.edu/ Cheatsheet: https://cme295.stanford.edu/cheatsheet/ Source lecture (Lecture 4, LLM training): https://www.youtube.com/watch?v=VlA_jt_3Qc4 License (lecture videos): as published on Stanford's public YouTube channel License (Amidi cheatsheets): MITThis lesson adapts the transfer-learning + pretraining-objective + CommonCrawl section of Stanford CME 295 Lecture 4 (~7m44s to ~13m23s, theopening pedagogical arc of the lecture). The lecture continues intoscaling laws (covered in Phase 3, lesson 2), parallelism and FlashAttention (Phase 3, lesson 3), and quantization (Phase 3, lesson 4).Clawdemy provides original notes, summaries, and quizzes derived fromthis material for educational purposes. All rights to the originallectures remain with Stanford and the instructors.Going deeper
Section titled “Going deeper”A short list, chosen for durability.
-
“Language Models are Few-Shot Learners”, Brown et al., 2020. The GPT-3 paper. Section 2 (Approach) covers the next-token pretraining objective at the scale that made decoder-only LLMs the dominant paradigm. The 300-billion-token figure matches the figure the Stanford lecturer cites.
-
“Improving Language Understanding by Generative Pre-Training”, Radford et al., 2018. The original GPT paper. Earlier and shorter than GPT-3, but the article that established the decoder-only-with-next-token-pretraining-then-fine-tune paradigm as a template. Worth reading after the GPT-3 paper for historical context.
-
Common Crawl. The open project the Stanford lecturer points to as the dominant pretraining data source. The site has the raw archives, statistics, and a useful “Get Started” page if you want to see the data shape directly.
-
“The RefinedWeb Dataset for Falcon LLM”, Penedo et al., 2023. A worked example of how a modern frontier-class pretraining set is built on top of Common Crawl: the deduplication, filtering, and quality steps. Useful for understanding what “filtered Common Crawl snapshot” actually means as an engineering effort.
-
Andrej Karpathy’s “Let’s build GPT: from scratch”. Two-hour video walking through next-token-prediction pretraining on a tiny dataset (Tiny Shakespeare). Pairs well with this lesson if you want to see the worked trace in code rather than on paper.
Adjacent topics
Section titled “Adjacent topics”Topics that build on or sit beside this one.
-
The next-token objective in encoder-only and encoder-decoder models. This lesson covers the decoder-only recipe. Encoder-only (BERT-family) uses masked language modeling, covered in Phase 2 lesson 7. Encoder-decoder (T5-family) uses span corruption, covered in Phase 2 lesson 6. The same word “pretraining” hides three different recipes; always read for which architecture family.
-
Why scale specifically works. This lesson asserts that next-token prediction works because of scale. The Phase 3, lesson 2 lesson on scaling laws and Chinchilla covers the empirical story behind that assertion: how loss falls predictably with more parameters and more data, and what the right balance is.
-
The cost of pretraining, in dollars and FLOPs. The lecturer’s “by far the most expensive” framing maps onto FLOPs (floating-point operations) as the unit of compute. The next lesson introduces FLOPs as a measure and quantifies what “training costs millions of dollars” means concretely.
-
Tokenization as a design choice. This lesson treats tokens as opaque inputs. The Phase 1 lesson on tokenization covers what tokens actually are, why some words split into pieces, and how vocabulary size affects everything downstream including pretraining cost. Worth a re-read after this lesson if the worked trace feels abstract.
-
Pretraining data quality and filtering. Beyond the named sources (Common Crawl, GitHub, Stack Overflow), modern pretraining sets do extensive deduplication, language identification, quality filtering, and toxicity screening. The RefinedWeb paper above is a good starting point. The choices made here have outsized effects on what the model learns.
Original sources
Section titled “Original sources”The primary papers, in chronological order.
-
“Improving Language Understanding by Generative Pre-Training”, Radford et al., 2018. Original GPT. The decoder-only-with-next-token-pretraining template.
-
“Language Models are Unsupervised Multitask Learners”, Radford et al., 2019. GPT-2. First evidence at scale that next-token pretraining alone produces broad capability.
-
“Language Models are Few-Shot Learners”, Brown et al., 2020. GPT-3. The scale-up that established next-token pretraining as the dominant LLM recipe.
-
“The Llama 3 Herd of Models”, Grattafiori et al., 2024. The Llama 3 paper. The 15-trillion-token figure matches the figure the Stanford lecturer cites.
Community discussion
Section titled “Community discussion”None selected for this lesson. The pretraining-recipe space at the level of this lesson is consolidated in the academic literature and the major-lab technical reports. Durable references will be added at a future quarterly review if any consolidate.