Skip to content

References: How instruction tuning makes a model helpful

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 5, LLM tuning): https://www.youtube.com/watch?v=PmW_TMQ3l0I
Companion lecture (Lecture 4, LLM training; covers SFT in passing): https://www.youtube.com/watch?v=VlA_jt_3Qc4
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the SFT recap section of Stanford CME 295 Lecture 5
(roughly 00:01:50 through 00:05:14, with a callback to the structural-
limitation framing at 00:11:09-00:11:33). The preference-tuning, RLHF,
PPO, reward-hacking, and DPO portions of the lecture are deliberately
deferred to the next two lessons in this phase. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

  • “Training language models to follow instructions with human feedback”, Ouyang et al., 2022. The InstructGPT paper. The methodology section walks through the SFT step in detail (Section 3.5), including the labeling instructions and dataset composition. The paper is the canonical written description of “SFT followed by preference tuning” as a recipe; this lesson is about the SFT half. Read Section 3.5 first if you want to see what an actual labeled SFT dataset looks like in practice.

  • “LoRA: Low-Rank Adaptation of Large Language Models”, Hu et al., 2021. The LoRA paper. The introduction and method section are unusually approachable for an ML paper. If the lesson’s one-line description of LoRA left you wanting the picture-version, this is where to go. The empirical section is what made the technique a default in the open-source ecosystem.

  • “Finetuned Language Models Are Zero-Shot Learners”, Wei et al., 2021. The FLAN paper. The first widely cited demonstration that instruction tuning across many tasks makes a model substantially better at instructions it has never seen. The empirical setup is the closest published analog to “what SFT actually does.” Useful if you want to understand why a relatively small instruction-tuning dataset generalizes the way it does.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The post-pretraining section (Section 5) gives a one-page reference covering SFT, preference tuning, and RLHF in their dense visual style. Print it; it pairs well with this lesson’s flashcards as a single-page review surface.

  • Hugging Face PEFT library documentation. The reference implementation used by most open-source LoRA fine-tuning today. If you want to actually run an SFT job (with LoRA) on a small open model, the PEFT docs are the practical entry point. Reading the introduction and the LoRA quickstart will give you the working artifact behind the conceptual one this lesson described.

Topics that build on or sit beside this one.

  • Instruction-tuning datasets in the open. Datasets like Alpaca, Dolly, and OpenAssistant Conversations are the public-facing analog of the proprietary SFT datasets at frontier labs. Browsing one of these datasets for fifteen minutes is a good way to internalize what “an SFT example” actually looks like at volume. The Hugging Face Datasets hub is the entry point.

  • Continued pretraining versus SFT. The pitfalls section flagged the boundary briefly. Continued pretraining injects new domain knowledge into the base model by feeding it more raw text from a new domain. SFT teaches response shape on knowledge already there. The line is fuzzy at very high SFT volumes. Many open-source domain-specific models combine both stages. The papers on domain-adaptive pretraining (e.g., BioMedLM, Code Llama’s continued pretraining stage) are useful starting points.

  • Quantized LoRA (QLoRA). Combines LoRA with weight quantization (covered in the previous lesson) so you can fine-tune even larger base models on commodity hardware. The QLoRA paper (Dettmers et al., 2023) is a useful next step if both lessons (quantization and this one) clicked.

  • What comes after SFT. Lesson 2 of this phase, How preferences become reward signals, picks up the negative-signal gap this lesson named and shows how human comparison data fills it. Lesson 3 covers the algorithms (RLHF and DPO) that put those preferences into the weights.

None selected for this lesson. The public discussion of SFT and parameter-efficient fine-tuning has consolidated around the Hugging Face PEFT documentation, the lab papers above, and a small number of high-quality open-source training repositories that rotate too quickly to be worth pinning. If a canonical thread surfaces, it will be added at the next quarterly review.