A separate model is trained from scratch for each task (spam, sentiment, topic), each on its own labeled dataset.
New: transfer learning
One large pretrained base model is built once on the open internet, then any specific task is reached by a small, cheap tuning run on top of that base.
update: weights nudged so " mat" gets a bit more probability next time
The ranking is what matters. The model has internalized that cats are more likely to be on floor-coverings than structural exteriors. That is what gets folded into weights, one tiny step at a time.
Pretraining cutoffs. The corpus was sampled at date X; the model knows the open web at X, plus or minus. Some assistants use live-web tools at inference, but that is a tool, not a change to the model’s brain.
Hallucinations
Pretraining-era statistics. The model learned the shape of “things that look like this” without learning the specific fact. Tuning improves how the model talks; it cannot retroactively add facts.
Personality differences across assistants
Tuning, not pretraining. Two assistants that feel different were tuned differently. Two that feel the same on factual questions probably share a pretraining lineage.
”A modern chat assistant was trained on chat data”
The base model was pretrained on next-token prediction over web-scale text, no chat involved. Tuning (Phase 4) added the conversational format and personality.
”Predicting the next word is a narrow task”
The objective is narrow; what the model has to learn to satisfy it at internet scale is not. The narrowness is in the objective, not in the resulting capability.
”All language models are pretrained the same way”
No. Decoder-only uses next-token prediction; BERT-family encoders use MLM; T5-family encoder-decoders use span corruption. Read for the architecture family.
”After pretraining, the model is ready to use”
A pretrained base model is fluent at continuing text but not yet a chat assistant. It does not know it is being asked questions, when to stop, or which answers are appropriate. Phase 4 handles all that.
Pretraining: the giant front-loaded training stage on a vast unlabeled corpus, run once. For decoder-only models, the objective is next-token prediction.
Next-token prediction: the objective of producing a probability distribution over the vocabulary for the next token, given a prefix.
Causal language modeling: another name for the same objective; “causal” because the prediction at each position only depends on tokens to its left.
Transfer learning: the paradigm of learning the underlying competence (language) once on a vast unlabeled corpus, then adapting to specific tasks via cheaper second-stage training.
Tuning: the smaller, cheaper second stage that adapts a pretrained base model into a usable assistant for a specific task. Includes instruction tuning, RLHF, DPO (Phase 4).
Common Crawl: the open web-crawler archive that is the dominant pretraining data source. ~3 billion pages added per month per the Stanford lecturer.
Token: a chunk of text the model operates on. Often a whole word for common words; longer or rarer words split into sub-pieces. Phase 1, lesson 1 covers tokenization in detail.
Knowledge cutoff: the date the pretraining corpus was sampled. The model knows the open web as of that date, with later updates limited to whatever tuning data was added.
Base model: the output of pretraining, before tuning. Fluent at continuing text, not yet a chat assistant.
Pretraining is one objective: predict the next token. Repeated billions of times across the open internet. Everything else, tuning and alignment and reasoning, is built on top.