Summary: Fine-tuning LLMs

The assistant-style models you use went through a different fine-tuning than the classifier in lesson 3. Task fine-tuning adds a head and trains a model for one narrow task. Supervised fine-tuning (SFT) keeps the language-modeling head and trains the generative model on many instruction-and-response examples, so it learns to follow instructions and answer like an assistant. SFT costs real compute, so the order is: try prompting an instruction-tuned model first, and use SFT only for template control, domain adaptation, or a cheaper specialized model. Its data is conversations (role-tagged messages), laid out by the model’s chat template (apply_chat_template). The tool is SFTTrainer from TRL, the Trainer loop specialized for SFT, and LoRA makes it affordable by training small added matrices instead of all the weights. This is the scan version; the lesson keeps everything at a mechanical, how-it-works level.

Core ideas

Task fine-tuning vs SFT. Task fine-tuning adds a head for one task (label output); SFT trains the generative model on instructions and responses to follow instructions broadly. One makes a model good at a task, the other makes a base model good at being an assistant.
Prompt first, SFT second. A prompt is free; SFT costs compute. Reach for SFT only for template control (strict format), domain adaptation, or cost (a smaller fine-tuned model).
SFT data is conversations. Role-tagged system/user/assistant messages, laid out by the model’s chat template via tokenizer.apply_chat_template. The wrong template breaks behavior.
SFTTrainer is the familiar loop, specialized. From TRL, built on transformers; configured with SFTConfig, auto-applies the chat template for messages datasets.
LoRA makes it affordable. Freeze the base weights, train small added low-rank matrices; large memory savings, pretrained knowledge preserved. The standard parameter-efficient (PEFT) approach.
SFT is one stage: pretrain (learn language), SFT (learn to follow instructions), then optional preference tuning (RLHF, DPO) to refine preferred responses.

What changes for you

This lesson demystifies the assistant and hands you a real capability. The instruction-following behavior you take for granted is not built into the architecture; it was trained in on top of a base model that, alone, would just autocomplete your text. That reframing matters: it explains why these models follow instructions, and why their behavior is only as good as the data they were tuned on, which leads directly into the next lesson on data quality. Practically, with the right sequencing (prompt first, then SFT with LoRA when prompting falls short) you can specialize an open model to your domain or output format on affordable hardware. A small open model plus targeted SFT is increasingly how teams reach production-quality behavior without paying for the largest general models. The next lesson turns to the data that makes any of this work.

The chat assistants you use are base models taught to behave, and SFT is where they learn it. Knowing task fine-tuning from instruction tuning is what lets you choose, correctly, between writing a better prompt and training the model itself.