References: How these models keep improving: DistilBERT and RoBERTa

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the BERT-derivatives section of Stanford CME 295
Lecture 2 (~6080s-6430s, the closing section of the lecture). With
this lesson, our adaptation of Lecture 2 is complete. The next Stanford
lecture (and the next track of Clawdemy lessons) opens the
post-pretraining and applications side of LLMs. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, Sanh et al., 2019. The DistilBERT paper. The lecturer’s “famously four pages” reference. Sections 2 and 3 cover the distillation setup and the architectural choice; section 4 covers the empirical results.
“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Liu et al., 2019. The RoBERTa paper. Methodically tests every design choice in the original BERT recipe. The NSP-doesn’t-help finding, dynamic masking, and the data-scale ablation are all in here.
“Distilling the Knowledge in a Neural Network”, Hinton, Vinyals, and Dean, 2015. The canonical distillation paper. The lecturer’s “soft targets contain almost all the knowledge” framing comes from related Hinton lecture material; this paper formalizes the loss function and the temperature-scaled softmax variant. Read after the DistilBERT paper for the conceptual foundation.
Hugging Face Transformers documentation for DistilBERT and RoBERTa. The de facto reference for using these models in practice. Loadable pre-trained checkpoints, fine-tuning examples, tokenizer differences. Both models are heavily used in production.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The wider BERT family. Beyond DistilBERT and RoBERTa: ALBERT (parameter-efficient via cross-layer weight sharing), ELECTRA (replaces MLM with replaced-token-detection for sample efficiency), DeBERTa (disentangled attention with separate position encoding). Plus domain variants (BioBERT, SciBERT, ClinicalBERT, FinBERT) and multilingual variants (mBERT, XLM, XLM-R). All are encoder-only descendants with their own takes on improving the original BERT recipe.
Distillation in modern LLMs. The teacher-student-soft-targets recipe is now a common production pattern for shipping smaller variants of larger language models. Search terms: “model distillation,” “knowledge distillation for LLMs,” “data-efficient distillation.”
Why NSP didn’t help (deeper dive). RoBERTa’s NSP ablation prompted further investigation. ALBERT replaced NSP with sentence-order-prediction (SOP), which is genuinely useful. The pattern: the original NSP task was too easy (random sentences are obviously different from genuine continuations), so the model’s “NSP signal” was almost trivially solvable and not contributing useful representations. Worth knowing for the broader lesson about training-objective design.
Where to go next in our adaptation. The Lecture 2 adaptation (six lessons) is complete here. The next planned arc is the third Stanford lecture (Large Language Models, which we have partial coverage of via L3.1 and L3.2 from a pre-protocol pass; those need transcript-faithful re-passes). After that, the rest of the Stanford CME 295 syllabus.

Original sources

The primary papers, in chronological order.

“Distilling the Knowledge in a Neural Network”, Hinton et al., 2015. The conceptual root of distillation.
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al., 2019. The model these derivatives improve on.
“RoBERTa”, Liu et al., 2019. The training-recipe improvements.
“DistilBERT”, Sanh et al., 2019. The compression via distillation.

Community discussion

None selected for this lesson. The BERT derivatives space is mature, with the relevant discussion consolidated into the academic literature and the Hugging Face documentation cycle. Durable references will be added at a future quarterly review if any consolidate.