References: Pretraining: how a model learns language by predicting the next word

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 4, LLM training):
    https://www.youtube.com/watch?v=VlA_jt_3Qc4
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the transfer-learning + pretraining-objective + Common
Crawl section of Stanford CME 295 Lecture 4 (~7m44s to ~13m23s, the
opening pedagogical arc of the lecture). The lecture continues into
scaling laws (covered in Phase 3, lesson 2), parallelism and Flash
Attention (Phase 3, lesson 3), and quantization (Phase 3, lesson 4).
Clawdemy provides original notes, summaries, and quizzes derived from
this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“Language Models are Few-Shot Learners”, Brown et al., 2020. The GPT-3 paper. Section 2 (Approach) covers the next-token pretraining objective at the scale that made decoder-only LLMs the dominant paradigm. The 300-billion-token figure matches the figure the Stanford lecturer cites.
“Improving Language Understanding by Generative Pre-Training”, Radford et al., 2018. The original GPT paper. Earlier and shorter than GPT-3, but the article that established the decoder-only-with-next-token-pretraining-then-fine-tune paradigm as a template. Worth reading after the GPT-3 paper for historical context.
Common Crawl. The open project the Stanford lecturer points to as the dominant pretraining data source. The site has the raw archives, statistics, and a useful “Get Started” page if you want to see the data shape directly.
“The RefinedWeb Dataset for Falcon LLM”, Penedo et al., 2023. A worked example of how a modern frontier-class pretraining set is built on top of Common Crawl: the deduplication, filtering, and quality steps. Useful for understanding what “filtered Common Crawl snapshot” actually means as an engineering effort.
Andrej Karpathy’s “Let’s build GPT: from scratch”. Two-hour video walking through next-token-prediction pretraining on a tiny dataset (Tiny Shakespeare). Pairs well with this lesson if you want to see the worked trace in code rather than on paper.

Adjacent topics

Topics that build on or sit beside this one.

The next-token objective in encoder-only and encoder-decoder models. This lesson covers the decoder-only recipe. Encoder-only (BERT-family) uses masked language modeling, covered in Phase 2 lesson 7. Encoder-decoder (T5-family) uses span corruption, covered in Phase 2 lesson 6. The same word “pretraining” hides three different recipes; always read for which architecture family.
Why scale specifically works. This lesson asserts that next-token prediction works because of scale. The Phase 3, lesson 2 lesson on scaling laws and Chinchilla covers the empirical story behind that assertion: how loss falls predictably with more parameters and more data, and what the right balance is.
The cost of pretraining, in dollars and FLOPs. The lecturer’s “by far the most expensive” framing maps onto FLOPs (floating-point operations) as the unit of compute. The next lesson introduces FLOPs as a measure and quantifies what “training costs millions of dollars” means concretely.
Tokenization as a design choice. This lesson treats tokens as opaque inputs. The Phase 1 lesson on tokenization covers what tokens actually are, why some words split into pieces, and how vocabulary size affects everything downstream including pretraining cost. Worth a re-read after this lesson if the worked trace feels abstract.
Pretraining data quality and filtering. Beyond the named sources (Common Crawl, GitHub, Stack Overflow), modern pretraining sets do extensive deduplication, language identification, quality filtering, and toxicity screening. The RefinedWeb paper above is a good starting point. The choices made here have outsized effects on what the model learns.

Original sources

The primary papers, in chronological order.

“Improving Language Understanding by Generative Pre-Training”, Radford et al., 2018. Original GPT. The decoder-only-with-next-token-pretraining template.
“Language Models are Unsupervised Multitask Learners”, Radford et al., 2019. GPT-2. First evidence at scale that next-token pretraining alone produces broad capability.
“Language Models are Few-Shot Learners”, Brown et al., 2020. GPT-3. The scale-up that established next-token pretraining as the dominant LLM recipe.
“The Llama 3 Herd of Models”, Grattafiori et al., 2024. The Llama 3 paper. The 15-trillion-token figure matches the figure the Stanford lecturer cites.

Community discussion

None selected for this lesson. The pretraining-recipe space at the level of this lesson is consolidated in the academic literature and the major-lab technical reports. Durable references will be added at a future quarterly review if any consolidate.