Practice: How instruction tuning makes a model helpful
Self-check
Section titled “Self-check”A short retrieval pass. Try to answer each question in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.
1. What is the difference between a base model and an instruction-tuned model?
Show answer
A base model is the output of pretraining. It has been trained on a massive corpus of unfiltered text using next-token prediction. It knows facts, grammar, and idioms; it can continue almost any kind of text in a stylistically plausible way. It does not follow instructions: write “summarize this paragraph” and it will continue the paragraph, not summarize it. An instruction-tuned model is the output of supervised fine-tuning (SFT). Same architecture, mostly the same weights, slightly nudged. Trained on a much smaller, much higher quality dataset of instruction-response pairs. It follows instructions: write “summarize this paragraph” and you get a summary in the shape of an answer, drawing on the knowledge the base model already had from pretraining.
2. Why is it incorrect to say “SFT teaches the model new knowledge”?
Show answer
At typical SFT volumes, the knowledge is already in the base model from pretraining. SFT teaches the model when to apply knowledge it already has. If you fine-tune on a few thousand French translation examples, you are not teaching it French; you are teaching it that when a user writes “translate to French,” the response should be in French rather than a continuation of the prompt. The capability was there. The trigger was not. SFT teaches response shape, not new content. (At very large SFT volumes the line with continued pretraining starts to blur, which the lesson’s pitfalls section flags.)
3. The Stanford lecturer says SFT “is all about teaching the model what it should predict, but it does not teach the model what it should not predict.” What does that mean in practice, and why does it motivate the next lesson?
Show answer
Every example in an SFT dataset is a positive example: input is an instruction, output is the response the labelers wanted to see. Nothing in the training data tells the model which other plausible responses are worse. No negative examples, no “this would have been almost right, but here is what was wrong with it.” The training signal is one-directional. As a result, the trained model picks a response that looks like the average of the labeler-written examples; it cannot distinguish the better from the worse among many valid-looking responses. The next lesson covers how to collect that preference data (which response is better than which); the lesson after it covers the algorithms that turn the preference data into a training signal the weights can use.
4. What is LoRA, and what is the practical reason it matters?
Show answer
LoRA (Low-Rank Adaptation) is a parameter-efficient way to do supervised fine-tuning. Instead of updating all of the model’s weights, the base weights are largely held fixed, and a small number of additional weights (in the form of low-rank matrices) get trained on the SFT data. The model behaves as if you fine-tuned the whole thing, but you only paid to update a tiny fraction of the parameters. Two practical consequences: SFT runs become much cheaper (you do not need a large GPU cluster to fine-tune a frontier model with LoRA), and you can keep many specialized fine-tunes around without storing many full copies of the model (each LoRA is small relative to the base). LoRA is conceptually identical to full SFT in what it teaches the model; it is an engineering refinement of how the training is done.
5. For each behavior below, decide whether it comes from pretraining, from SFT, or from both.
a. The model correctly states that the boiling point of water is 100°C. b. When you write “summarize this paragraph,” it produces a summary instead of continuing the paragraph. c. When asked to translate, it produces output in the requested target language. d. When given a math problem, it produces something in the shape of a multi-step solution. e. When asked an out-of-distribution question, it sometimes confidently produces a plausible-sounding wrong answer.
Show answer
a. Pretraining. Factual knowledge encoded in weights from reading large text corpora.
b. SFT. The base model would have continued the paragraph; SFT taught it that “summarize this” produces a summary.
c. Pretraining and SFT. Pretraining gave the model the target language (it learned French during pretraining). SFT gave it the trigger pattern (“translate to French” means “produce French output”).
d. SFT primarily, with knowledge from pretraining. The mathematical knowledge is from pretraining. The shape of “problem then steps then answer” is taught by SFT examples that follow that structure.
e. Pretraining gave it the confidence; SFT did not teach it to acknowledge uncertainty. The base model is comfortable continuing any text plausibly. SFT does not teach the model what not to predict, so confident-but-wrong answers are a structural limitation of an SFT-only model. The “I don’t know” capability comes from a stage after SFT, covered in the next two lessons.
6. The lesson says volume “drops by many orders of magnitude” between pretraining and SFT. Why is the volume drop interesting, not just an observation?
Show answer
It tells you most of the work has already been done by the time SFT runs. Pretraining is months of compute on trillions of tokens; SFT is hours to days of compute on a curated corpus. The base model has all the underlying capability; SFT just teaches it when to apply it. That asymmetry is why a research lab can release new instruction-tuned variants of a base model on a near-continuous cadence (post-training is days), while truly new base models come out much less often (pretraining is months and tens of millions of dollars of compute). Knowing where the cost lives lets you read release notes more accurately.
Try it yourself: which stage adds which behavior
Section titled “Try it yourself: which stage adds which behavior”About 10 minutes. Pen and paper.
Setup. Below are six observed behaviors of a modern AI assistant. For each one, mark which training stage is the primary source of the behavior: pretraining, SFT, both (and which is bigger), or neither this nor that, comes from a later stage. The point of the exercise is to internalize where the boundary between pretraining and SFT lies, not to debate edge cases.
1. The assistant correctly states the population of France.
2. When asked "Translate 'good morning' to Japanese," it produces a translation rather than continuing the prompt as text.
3. Given a recipe request, it produces a list with quantities, then steps, then notes; the structure feels coherent.
4. It refuses to help with a clearly harmful request, explaining why.
5. When asked an obscure question outside its knowledge, it sometimes hedges ("I'm not certain, but...") rather than producing a confident wrong answer.
6. Asked about a recent event from this year, it acknowledges uncertainty about the cutoff date of its training data.Expected outcomes:
- Pretraining. Factual knowledge from reading text.
- SFT. The base model would continue the prompt; SFT taught the request-to-response mapping.
- SFT primarily. The structured-output shape (list, steps, notes) comes from SFT examples. Some refinement comes from preference tuning, but the core “produce structured output when the request implies it” is SFT.
- A stage after SFT, covered in the next lessons. Refusal is not something SFT teaches well, since the SFT signal is one-directional. Categorical refusal of harmful requests is a learned property that comes from later post-training stages.
- A stage after SFT. Calibrated uncertainty (“I’m not certain”) is the kind of behavior SFT cannot teach by construction; it requires the model to learn what not to predict confidently, which is exactly the gap the lesson named.
- Pretraining sets the cutoff; a later stage teaches the model to acknowledge it. The training data simply ends at some point (pretraining artifact). The model’s willingness to flag the gap comes from a stage after SFT.
Sanity check. The cleanest signals are at the edges. Knowing facts is pretraining. Producing a response shape rather than a continuation is SFT. Anything that looks like “the model knows when not to do something” almost certainly came from a stage after SFT. If you found yourself reaching for “later stage” on items 4-6, you are following the right gradient: those are exactly the behaviors Phase 4 lessons 2 and 3 explain.
Flashcards
Section titled “Flashcards”Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. What is a base model?
The output of pretraining. Trained on a massive text corpus with next-token prediction as the objective. Knows facts, languages, and idioms. Does not follow instructions; it continues plausible text.
Q. What is an instruction-tuned model?
The output of supervised fine-tuning (SFT). Same architecture as the base model, mostly the same weights, slightly nudged. Trained on a curated corpus of instruction-response pairs. Follows instructions; produces a response shape rather than a continuation shape.
Q. What is the SFT objective?
Same as pretraining: next-token prediction. The difference is the data (curated instruction-response pairs instead of raw web text) and the scope (only the response tokens contribute to the loss, not the instruction tokens).
Q. What does SFT actually teach the model?
Response shape, not new knowledge. The base model already knows the content from pretraining; SFT teaches it that an instruction is a request for a response in a particular shape. A few thousand high-quality examples can change surface behavior dramatically because the underlying capability was already there.
Q. Why does the volume drop from pretraining to SFT matter?
It tells you most of the work has been done before SFT runs. Pretraining is months on trillions of tokens; SFT is hours to days on a curated corpus. The cost asymmetry is why post-training updates can ship on a near-continuous cadence while truly new base models come out much less often.
Q. What is LoRA?
Low-Rank Adaptation. A parameter-efficient way to do SFT: hold most of the base weights fixed and train a small set of low-rank matrices on the SFT data. Conceptually identical to full SFT; engineering refinement that makes runs cheaper and lets you keep many specialized fine-tunes alongside one base.
Q. What is PEFT?
Parameter-Efficient Fine-Tuning. The umbrella term for techniques (including LoRA) that fine-tune a model by updating only a small fraction of its parameters.
Q. What is the lecturer's structural limit on SFT?
“SFT is all about teaching the model what it should predict, but it does not teach the model what it should not predict.” Every SFT example is a positive example. There is no negative signal. The next lesson is about how preference data can teach the model the difference between better and worse responses.
Q. What does it mean to call an instruction-tuned model 'correct on average'?
It produces a response in the right shape, drawing on knowledge from pretraining. Among the many plausible responses it could produce for a given instruction, it picks something close to the average of the training examples. Often correct; rarely the best possible answer.
Q. What is the one-sentence takeaway from this lesson?
Pretraining fills the weights with everything the model knows; supervised fine-tuning teaches it to answer when someone asks.