Skip to content

Cheatsheet: Pretraining: how a model learns language by predicting the next word

For a decoder-only model:
pretraining objective = predict the next token
data = the open internet (Common Crawl + code + more)
scale = hundreds of billions to tens of trillions of tokens
cost = by far the most expensive stage
The base model is fluent at continuing text.
It is not yet an assistant. Tuning (Phase 4) handles that.
ParadigmPattern
Old: one model per taskA separate model is trained from scratch for each task (spam, sentiment, topic), each on its own labeled dataset.
New: transfer learningOne large pretrained base model is built once on the open internet, then any specific task is reached by a small, cheap tuning run on top of that base.
PropertyValue
Objective nameNext-token prediction (also “causal language modeling,” also “language modeling”)
Data labelsNone. Source text is its own training signal.
Architecture familyDecoder-only. (BERT-family uses MLM; T5-family uses span corruption.)
What the model produces per stepA probability distribution over the entire vocabulary for the next token
LossNegative log of the probability the model gave to whatever token actually came next: -log(p)
Effect on weightsTiny adjustment so the same prefix gets slightly more probability on the actual next token
RepeatedAcross every position in every sentence in the training set, billions of times
SourceDetail
Common CrawlOpen web crawler archive. ~3 billion pages added per month per the Stanford lecturer. The dominant base.
Wikipedia and RedditCited by name in lecture as examples of the kinds of sites inside the crawl.
CodeThe lecturer specifically names GitHub and Stack Overflow. Layered alongside natural language.
Non-English textMultilingual support comes from including multiple languages in the same training set.
Token-scale anchors (cited in lecture)
Order of magnitudeHundreds of billions to tens of trillions of tokens
GPT-3Roughly 300 billion tokens
Llama 3Roughly 15 trillion tokens
source: "The cat sat on the mat."
tokenized: [The] [ cat] [ sat] [ on] [ the] [ mat] [.]
step (target = " mat"):
prefix: [The] [ cat] [ sat] [ on] [ the]
output: probability over entire vocabulary, e.g.:
" mat": 0.31
" floor": 0.12
" bed": 0.08
" roof": 0.005
... (~50,000 other tokens)
signal: the actual next token was " mat"
loss: -log(0.31) ≈ 1.17
update: weights nudged so " mat" gets a bit more probability next time

The ranking is what matters. The model has internalized that cats are more likely to be on floor-coverings than structural exteriors. That is what gets folded into weights, one tiny step at a time.

PhenomenonWhere it comes from
Knowledge cutoffsPretraining cutoffs. The corpus was sampled at date X; the model knows the open web at X, plus or minus. Some assistants use live-web tools at inference, but that is a tool, not a change to the model’s brain.
HallucinationsPretraining-era statistics. The model learned the shape of “things that look like this” without learning the specific fact. Tuning improves how the model talks; it cannot retroactively add facts.
Personality differences across assistantsTuning, not pretraining. Two assistants that feel different were tuned differently. Two that feel the same on factual questions probably share a pretraining lineage.
PitfallReality
”A modern chat assistant was trained on chat data”The base model was pretrained on next-token prediction over web-scale text, no chat involved. Tuning (Phase 4) added the conversational format and personality.
”Predicting the next word is a narrow task”The objective is narrow; what the model has to learn to satisfy it at internet scale is not. The narrowness is in the objective, not in the resulting capability.
”All language models are pretrained the same way”No. Decoder-only uses next-token prediction; BERT-family encoders use MLM; T5-family encoder-decoders use span corruption. Read for the architecture family.
”After pretraining, the model is ready to use”A pretrained base model is fluent at continuing text but not yet a chat assistant. It does not know it is being asked questions, when to stop, or which answers are appropriate. Phase 4 handles all that.
  • Pretraining: the giant front-loaded training stage on a vast unlabeled corpus, run once. For decoder-only models, the objective is next-token prediction.
  • Next-token prediction: the objective of producing a probability distribution over the vocabulary for the next token, given a prefix.
  • Causal language modeling: another name for the same objective; “causal” because the prediction at each position only depends on tokens to its left.
  • Transfer learning: the paradigm of learning the underlying competence (language) once on a vast unlabeled corpus, then adapting to specific tasks via cheaper second-stage training.
  • Tuning: the smaller, cheaper second stage that adapts a pretrained base model into a usable assistant for a specific task. Includes instruction tuning, RLHF, DPO (Phase 4).
  • Common Crawl: the open web-crawler archive that is the dominant pretraining data source. ~3 billion pages added per month per the Stanford lecturer.
  • Token: a chunk of text the model operates on. Often a whole word for common words; longer or rarer words split into sub-pieces. Phase 1, lesson 1 covers tokenization in detail.
  • Knowledge cutoff: the date the pretraining corpus was sampled. The model knows the open web as of that date, with later updates limited to whatever tuning data was added.
  • Base model: the output of pretraining, before tuning. Fluent at continuing text, not yet a chat assistant.

Pretraining is one objective: predict the next token.
Repeated billions of times across the open internet.
Everything else, tuning and alignment and reasoning, is built on top.