Skip to content

References: How these models keep improving: DistilBERT and RoBERTa

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the BERT-derivatives section of Stanford CME 295
Lecture 2 (~6080s-6430s, the closing section of the lecture). With
this lesson, our adaptation of Lecture 2 is complete. The next Stanford
lecture (and the next track of Clawdemy lessons) opens the
post-pretraining and applications side of LLMs. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

A short list, chosen for durability.

Topics that build on or sit beside this one.

  • The wider BERT family. Beyond DistilBERT and RoBERTa: ALBERT (parameter-efficient via cross-layer weight sharing), ELECTRA (replaces MLM with replaced-token-detection for sample efficiency), DeBERTa (disentangled attention with separate position encoding). Plus domain variants (BioBERT, SciBERT, ClinicalBERT, FinBERT) and multilingual variants (mBERT, XLM, XLM-R). All are encoder-only descendants with their own takes on improving the original BERT recipe.

  • Distillation in modern LLMs. The teacher-student-soft-targets recipe is now a common production pattern for shipping smaller variants of larger language models. Search terms: “model distillation,” “knowledge distillation for LLMs,” “data-efficient distillation.”

  • Why NSP didn’t help (deeper dive). RoBERTa’s NSP ablation prompted further investigation. ALBERT replaced NSP with sentence-order-prediction (SOP), which is genuinely useful. The pattern: the original NSP task was too easy (random sentences are obviously different from genuine continuations), so the model’s “NSP signal” was almost trivially solvable and not contributing useful representations. Worth knowing for the broader lesson about training-objective design.

  • Where to go next in our adaptation. The Lecture 2 adaptation (six lessons) is complete here. The next planned arc is the third Stanford lecture (Large Language Models, which we have partial coverage of via L3.1 and L3.2 from a pre-protocol pass; those need transcript-faithful re-passes). After that, the rest of the Stanford CME 295 syllabus.

The primary papers, in chronological order.

None selected for this lesson. The BERT derivatives space is mature, with the relevant discussion consolidated into the academic literature and the Hugging Face documentation cycle. Durable references will be added at a future quarterly review if any consolidate.