Summary: Pretraining: how a model learns language by predicting the next word
Pretraining is one objective, repeated at scale. For decoder-only models (the dominant family for generative AI today, more than 90 percent of modern LLMs by the Stanford lecturer’s count), that objective is predict the next token. There is no labeled data, no human annotation. The model is shown a piece of text, asked to assign a probability to every possible next token, and rewarded for putting probability mass on whatever token actually came next in the source. Repeated billions of times across the open internet, this single objective produces almost every language ability you experience when you talk to a modern AI.
This summary is the scan-it-in-five-minutes version. The full lesson covers the transfer-learning paradigm shift, the next-token objective, the data, a worked trace of one training step, and the common misreads that the pitfalls section addresses.
Core ideas
Section titled “Core ideas”- The lesson is about decoder-only models specifically. BERT (Phase 2, lesson 7) uses masked language modeling; T5 (Phase 2, lesson 6) uses span corruption. Decoder-only models, the family this lesson covers, use next-token prediction. When a paper or model card says “pretrained on…” this is almost always the recipe meant.
- The old paradigm: one model per task. Spam detection, sentiment extraction, topic classification: each got its own model trained from scratch on its own labeled dataset. Each model had to relearn English from its own training set.
- Transfer learning is the move from “one model per task” to “one general model, then adapt.” Pretrain once on a vast unlabeled corpus to learn the underlying competence (language). Then for any specific task, run a much smaller, much cheaper second stage to tune the pretrained model.
- The pretraining objective is one sentence long: predict the next token. No labels. The source text is its own training signal: feed all but the last token into the model, ask it to assign a probability to every possible next token, reward it for putting mass on whatever came next.
- Why next-token prediction works. Predicting the next word in arbitrary text exercises almost everything else: world knowledge, grammar, statistical patterns, discourse logic. A model trained only on this single objective ends up with a usable internal map of how concepts relate.
- The data is web-scale. Common Crawl is the dominant source: an open project that adds something like three billion pages per month to its archive. Most modern LLM pretraining sets are built on a filtered Common Crawl snapshot.
- Code, books, papers, and non-English text are layered on top. The lecturer specifically names GitHub and Stack Overflow for code, and notes that text in non-English languages is included.
- Scale is measured in tokens. Pretraining corpora run from hundreds of billions to tens of trillions of tokens. Two examples cited in lecture: GPT-3 roughly 300 billion tokens, Llama 3 roughly 15 trillion. The order of magnitude is the takeaway.
- One training step, concretely. The model sees a prefix, outputs a probability distribution over the entire vocabulary, gets a loss based on the negative log of the probability it gave to whatever token actually came next, and adjusts its weights very slightly. Repeated for every position in every sentence in every web page in the training set.
- Pretraining is “by far the most expensive” stage. That is the lecturer’s exact phrase. The next lesson (Phase 3, lesson 2) covers the scaling-laws math that quantifies the cost.
- A pretrained base model is not a chat assistant. It is fluent at continuing text, but does not know it is being asked questions or that it should stop talking. Turning a base model into the assistant you actually use is what Phase 4 covers (instruction tuning, RLHF, DPO).
- Knowledge cutoffs are pretraining cutoffs. A chat assistant’s “I don’t know about events after X” is the date the pretraining corpus was sampled. Tuning adds little new factual material.
- Hallucinations usually trace to pretraining-era statistics. When a model invents a plausible citation or asserts a wrong fact confidently, the cause is almost always that pretraining learned the statistical shape of “things like this” without learning the specific fact. Tuning improves how the model talks; it cannot retroactively add facts.
- The model’s “personality” is tuning, not pretraining. Friendly tone, refusal patterns, structured response style: all that is added in Phase 4. Two assistants that feel different were tuned differently; two that feel the same on factual questions probably share a pretraining lineage.
- Pitfall: not all language models are pretrained the same way. Decoder-only is the dominant family but not the only one. Always read for which architecture family a paper means by “pretraining.”
- Pitfall: predicting the next word is not a narrow task. It looks narrow because it is one objective, but successfully predicting next words across the open internet requires the model to internalize an enormous amount of indirect knowledge.
What changes for you
Section titled “What changes for you”When you read about a model “pretrained on X tokens,” you now know what that one sentence means: the model was shown text from X tokens of source data and asked to predict the next token at every position. When you experience knowledge cutoffs, hallucinations, or personality differences across assistants, you can identify which stage of training is responsible. When you see “transfer learning” in a paper or a job description, you know the paradigm: the expensive stage runs once, on someone else’s dime, and is then reused across many cheaper tuning runs. The next lesson takes the cost claim (“by far the most expensive”) and explains why scale specifically is what makes the objective work.
Pretraining is one objective: predict the next token.
Repeated billions of times across the open internet.
Everything else, tuning and alignment and reasoning, is built on top.