Summary: How these models keep improving: DistilBERT and RoBERTa
Most BERT-family models in production today are not BERT. They are DistilBERT or RoBERTa, and each was a one-paper response to a specific limitation in the original. DistilBERT addressed size and latency through knowledge distillation: train a smaller student to mimic a larger teacher’s output distribution. RoBERTa addressed pretraining complexity by dropping NSP, doing dynamic masking on every epoch, and training on much more data. Both are short papers with outsized influence. With this lesson, the Lecture 2 adaptation is complete.
This summary is the scan-it-in-five-minutes version. The full lesson covers BERT’s limitations, the conceptual framing of distillation, what DistilBERT does specifically, and what RoBERTa changed in the training recipe.
Core ideas
Section titled “Core ideas”- BERT had three concrete limitations. Context length (originally 512 tokens; addressed by attention efficiency tricks from L2.3). Latency and size (110M parameters at BERT-base scale; addressed by DistilBERT). Pretraining complexity (MLM + NSP were assumed but not tested; addressed by RoBERTa).
- Knowledge distillation, the concept. Hinton’s framing: “the soft targets contain almost all the knowledge.” A trained model produces a probability distribution over output classes; the full distribution carries more information than the single most likely class. Distillation trains a smaller “student” model to match the full output distribution of a larger “teacher” model.
- The objective is KL divergence. Measures how close the student’s distribution is to the teacher’s. Cross-entropy loss is a special case (when the target is a hard label). So distillation generalizes supervised learning’s loss; it is a richer training signal, not a different paradigm.
- DistilBERT halves BERT’s layer count. From BERT-base’s 12 layers to 6 in DistilBERT. Train the smaller student via distillation against the original BERT’s output distribution. Result: roughly half the size, almost the same downstream performance, correspondingly faster at inference.
- DistilBERT is famously short. ~4 pages. Influence outsized to length because distillation as a concept was already established; DistilBERT just applied it to BERT cleanly.
- RoBERTa is the same architecture with a better training recipe. Three changes, all in pretraining.
- Change 1: drop NSP. The RoBERTa authors tested whether NSP was actually contributing. The lecturer’s framing: removing NSP led to “no decrease in performance almost.” So they dropped it. The original BERT authors had assumed NSP was helping; the empirical evidence said otherwise.
- Change 2: dynamic masking. In original BERT, the masking pattern for an input is decided once during data preparation. RoBERTa re-masks on every epoch. Effectively gives the model many more “different” training examples from the same source data.
- Change 3: much more data. RoBERTa observed BERT was undertrained; scaled up the pretraining data significantly. Benchmark performance improved.
- The two derivatives address different limitations. DistilBERT: smaller and faster (distillation, slight quality drop). RoBERTa: same size, better quality (recipe changes). Choice depends on which constraint matters most.
- They can stack. DistilRoBERTa exists in the wider ecosystem as a distilled version of RoBERTa.
- Pitfall: DistilBERT and RoBERTa are not competitors. They address different limitations and combine.
- Pitfall: distillation requires the teacher’s full distribution. Without that, the smaller model is just a smaller model trained from scratch; the soft-targets advantage is lost.
- Pitfall: RoBERTa did not change the architecture. Same as BERT. Only the pretraining recipe is different.
- Pitfall: the BERT paper’s NSP defense was an assumption, not empirical proof. RoBERTa tested it; the assumption did not hold.
What changes for you
Section titled “What changes for you”When you load DistilBERT or RoBERTa from a model repository (or any of their descendants like DistilRoBERTa), you know the lineage. DistilBERT: smaller-faster via distillation. RoBERTa: same size, better trained. The choice depends on whether you need the speed or the quality. Distillation as a concept also generalizes well past BERT: the teacher-student-soft-targets recipe is a common production pattern for shipping smaller variants of larger models, including in the modern LLM era.
Lecture 2 closes here. The lecture walked the modern transformer’s evolution from the 2017 paper through three architectural branches (encoder-decoder, decoder-only, encoder-only) and ended on BERT’s two most influential descendants. The next Stanford lecture (and the next Clawdemy track lessons) will open the post-pretraining and applications side of LLMs.
DistilBERT is BERT compressed via distillation.
RoBERTa is BERT trained better.
Same family, different problems.