How instruction tuning makes a model helpful

A pretrained transformer is a great autocompleter. It is not an assistant.

If you take a fresh base model (the raw output of pretraining) and give it the prompt “Translate this to French: hello”, a plausible thing for it to produce next is “Translate this to Spanish: hola”. Not because it is being clever. Because somewhere in the trillions of tokens it read during pretraining, it has seen exactly that pattern: someone asks for a translation, then someone else asks for a different translation. The base model is continuing that pattern. It is not interpreting your request. It is doing what every previous lesson in this track described: predicting plausible next tokens given the previous ones.

The chat assistants you actually use feel different. You write “summarize this paragraph” and they produce a summary, not a continuation of the paragraph. You write “explain this code” and you get an explanation, not more code. That gap, between predicting plausible text and responding to your request, is the gap this phase closes.

The Stanford lecturer puts it as a one-line setup: pretraining gives you a model that is “a great autocompleter, but it’s not a helpful model yet, which is why we have a second step.” That second step is supervised fine-tuning, often abbreviated SFT. It is what this lesson is about, and it is the thing that turns a base model into something that follows instructions at all.

By the end you will know what SFT changes about the model, what it does not change, why a few thousand carefully written examples can transform surface behavior, what kind of model you have at the end of this stage, and the structural limitation that makes the next lesson necessary.

What pretraining gives you (and what it does not)

A base model is what you have at the end of the pretraining loop. The Phase 3 lessons described that loop: trillions of tokens of unfiltered text, one objective (predict the next token), months of compute, the engineering tricks that make it tractable on real hardware. What you get at the end of all that is impressive in a specific way and limited in another.

What pretraining gives you, in practice:

A vast store of factual knowledge implicitly encoded in the weights.
A working sense of grammar, syntax, and idiom across the languages it was trained on.
The ability to continue almost any kind of text in a stylistically plausible way.

What pretraining does not give you:

Instruction-following. The base model continues patterns; it does not interpret commands.
Helpfulness, in any human sense. Helpfulness is a property of responses to a request. The base model has no concept of a request; it is just continuing text.
A reliable response shape. Ask the same prompt twice and you may get a continuation, a list, a question, or a contradiction, depending on what plausibly follows in some corpus.

The lecturer’s framing is the same one in different words: the pretrained model “knows about the structure of language, about code, basically all the text that it has been fed. But what this model can do is only predict the next token.” Knowing language is necessary. It is not sufficient.

SFT in one paragraph

Take the base model. Show it a small dataset of instruction-response pairs hand-written by humans. Train it the same way pretraining trained it (predict the next token), but only on those examples, and only train it to predict the response tokens given the instruction tokens. After enough examples, the model generalizes: when it sees a new instruction it has never seen before, it produces something that looks like a valid response, in the shape the training examples used.

That is the whole mechanism. Same loss function as pretraining. Same architecture. Same weights, slightly nudged. Different data and a different scope.

A typical SFT example might look like this in raw form:

Instruction: Translate the following English sentence to French.
Input: The cat sat on the mat.
Response: Le chat s'est assis sur le tapis.

The model is taught to predict the response tokens given the instruction-and-input tokens. After enough of these (the lecturer’s qualitative phrasing is “much smaller in scale, much higher quality” than the pretraining corpus), the model has learned the pattern of “instruction goes in, response comes out” well enough to apply it to instructions it has never seen.

What SFT actually adds (and what was already there)

This is the load-bearing distinction in the entire phase, and it is the place beginners most often get the mental model wrong.

SFT does not teach the model new knowledge. SFT teaches the model when to apply the knowledge it already has. The base model already knows French because it read French during pretraining. It already knows how to write code, summarize, explain, and translate, in the sense that all those capabilities are latent in its weights. What it does not know is that a user prompt asking for a translation is a request for a translation. SFT teaches it that. The lecturer puts it this way: “the model already knows what language is, what code is. You’re just trying to make it behave like the use case you’re trying to tune it for.”

So when you fine-tune a base model on a few thousand French-translation examples, you are not teaching it French. You are teaching it that the string “translate to French” is a signal to produce French output rather than continue the prompt as text. The capability was there. The trigger was not.

This is why a few thousand high-quality SFT examples can change surface behavior dramatically. Relative to pretraining, the weights barely move during SFT. But the pattern of producing a response shape rather than a continuation shape shifts decisively. SFT is teaching a relatively shallow capability that the base model has all the underlying components for. It just teaches it when to apply them.

The volume drop

Look at the data scale across the two stages so far:

Stage	Typical data	Objective	Time
Pretraining	Trillions of tokens of unfiltered web text	Next-token prediction	Months on a cluster
SFT	A curated, much smaller corpus of instruction-response pairs	Same: next-token prediction (on response tokens)	Hours to days, depending on model size

Volume drops by many orders of magnitude. Compute drops by a similar factor. And yet the model after SFT feels recognizably like an assistant in a way the base model never could.

That asymmetry is the whole reason post-training is interesting. You do most of the work in pretraining (building general capability into the weights at enormous cost), and then a tiny, targeted training run shapes how that capability presents itself. The lecturer’s phrasing: SFT data is “much smaller in scale but of much higher quality.” Quality means the examples were written carefully, by humans, demonstrating exactly the response shape the model is meant to produce.

Parameter-efficient fine-tuning (a quick name)

The lecturer mentions one engineering refinement worth a name: LoRA (Low-Rank Adaptation). LoRA is a way of doing SFT without updating all of the model’s weights. Instead, the base weights are largely held fixed, and a small number of additional weights (in the form of low-rank matrices) get trained on the SFT data. The model behaves as if you fine-tuned the whole thing, but you only paid to update a tiny fraction of the parameters.

This matters in practice for two reasons: SFT runs become much cheaper (you do not need a cluster of GPUs to fine-tune a frontier model with LoRA), and you can keep many specialized fine-tunes around without storing many full copies of the model (each LoRA is small relative to the base). LoRA does not change anything about what SFT does conceptually. It changes the engineering of how you do it. The mental model from the previous section (response shape, not new knowledge) still applies.

You will see “LoRA” and “PEFT” (parameter-efficient fine-tuning, the umbrella term) in model release notes and open-source training repositories. The lesson does not need more than the name and the one-line idea.

What kind of model you have at the end

After SFT, you have an instruction-tuned model. Concretely:

It follows instructions. Asked to summarize, it summarizes. Asked to translate, it translates. Asked to explain, it explains.
It produces responses in a recognizable shape: greeting, body, closing, formatted appropriately for the request.
It has all the knowledge of the base model, accessible now in response form.

This is genuinely useful. Many open-source models you can download today are SFT-only. You could ship one as a chat assistant and many tasks would work fine. The lecturer comes back to this point explicitly: at the end of stage 2, the assistant may already behave the way you want, but, as he immediately adds, “not at the tone that you want.” That qualifier is the load-bearing one and is worth spelling out. SFT teaches the model to produce a response. It does not teach the model which response, among the many it could produce, is best.

Later in the lecture, the lecturer makes that limitation structural in a way every reader of this lesson should remember: “SFT is all about teaching the model what it should predict, but it does not teach the model what it should not predict.” Every example in an SFT dataset is a positive example, “this is what to do.” There are no negative examples, no “this would also be valid but worse,” no “this is a failure mode to avoid.” The training data does not contain that information, so the trained model does not learn it.

A concrete illustration of the gap: ask an instruction-tuned model “suggest a new activity I could do with my teddy bear,” and a plausible SFT response is “I would suggest not spending much time with your teddy bear at all.” That is in the shape of an answer. It is grammatically a response to the request. It is also unhelpful, slightly mean, and the kind of response a real assistant would never produce. The lecturer uses this example as the bridge into the next stage: SFT got you to “an answer shape;” preferences are how you get to “the better answer.”

Why this matters when you use AI

Three direct consequences when you read about AI models or interact with one.

“The base model” and “the assistant” are different artifacts. When a research lab releases both, you should treat them differently. The base model is the post-pretraining checkpoint; it does not follow instructions in the way you expect. The assistant is the post-tuning checkpoint; it does. This distinction is technical, not marketing.
An open-source “instruction-tuned” model has typically had SFT but may or may not have anything beyond. You will see model cards that say “fine-tuned on N instruction-response pairs” or “trained with SFT on dataset X.” That is exactly the stage this lesson covered. Such models tend to feel less polished than fully tuned commercial assistants, and the gap you feel is exactly what the next two lessons close.
Knowledge is from pretraining, but the answer is from SFT. When a model gives you a wrong answer, the wrongness often comes from one or the other. Pretraining can be missing the fact, or have an outdated version of it. SFT can be applying the wrong response shape (giving you a list when you wanted a paragraph, or formal language when you wanted casual). The two stages tend to fail in different ways, and noticing which is which makes you a better user of the model.

Common pitfalls

Two mistakes worth naming.

Thinking SFT alone is sufficient for a polished assistant. SFT is a real capability jump. It is not the last step. The “no negative signal” limitation the lecturer named is structural; you cannot SFT your way past it without changing the training objective. The next lesson is the structural fix.

Assuming SFT teaches new knowledge. It does not. Knowledge comes from pretraining. SFT teaches the model when to deploy what it already has. (One nearby gotcha: at very high SFT volumes the line with continued pretraining starts to blur, since enough new data can move the weights enough to inject new knowledge. At typical SFT scales the distinction is clean; the mental model “knowledge is from pretraining” remains the right starting point.)

What you should remember

Pretraining produces a base model. It is good at predicting next tokens. It does not follow instructions.
SFT teaches response shape, not new knowledge. Same loss as pretraining, different data: a curated, much smaller corpus of instruction-response pairs. The model learns to produce a response when it sees an instruction.
A few thousand high-quality examples are enough. The capability is already in the weights from pretraining. SFT activates it.
LoRA is a parameter-efficient way to do SFT. Engineering refinement, not a conceptual change.
The end-state is “correct on average.” SFT can only inject positive signal (“here is what to do”). It cannot teach which of many valid responses is best, and it cannot teach what not to predict. Lesson 2 is about how to collect those preferences and use them to teach the better response.

If you remember one thing

Pretraining fills the weights with everything the model knows.
Supervised fine-tuning teaches it to answer when someone asks.