Teacher (large, well-trained) → produces full output distribution
Student (smaller) → trained to match teacher's distribution
Loss: KL divergence between student and teacher distributions.
Cross-entropy is a special case (when target is a hard label).
| Why “soft targets” | Hinton’s framing: “the soft targets contain almost all the knowledge.” A model’s full output distribution carries information about how each input relates to every class, not just the chosen one. That richness is the training signal a hard label throws away. |
Drop NSP entirely. Lecturer’s framing: removing NSP led to “no decrease in performance almost.” Original BERT had assumed NSP was helping; the empirical evidence said otherwise.
Recipe change 2
Dynamic masking. In original BERT, masking pattern is fixed once during data prep. RoBERTa re-masks every epoch. Effectively gives many more “different” training examples from the same source data.
Recipe change 3
More data. BERT was “vastly undertrained” per the lecturer. RoBERTa scaled up the pretraining data significantly in volume and diversity. Benchmark performance improved correspondingly.
DistilRoBERTa exists in the wider ecosystem as a distilled version of RoBERTa. It combines RoBERTa’s training improvements with DistilBERT’s compression approach. Use it when both quality and latency matter.
No, they address different limitations and stack (DistilRoBERTa exists).
Distillation is “just a smaller copy”
No. Distillation requires the teacher’s full output distribution as the training signal. Without that, the smaller model is just a smaller model trained from scratch.
RoBERTa changed the architecture
No. Same as BERT. Only the pretraining recipe is different.
The BERT paper’s NSP defense was empirical proof
No. It was an assumption that turned out to be wrong on closer examination. RoBERTa tested the assumption and found it did not hold. Worth noticing when you read paper design rationales: assumptions and tested results are different things.
Knowledge distillation: training a smaller “student” model to match the output distribution of a larger “teacher” model, instead of training the student on the original task’s hard labels.
Soft targets: the full probability distribution over output classes that a trained model produces. The richer signal that distillation uses.
Hard label: the single most-likely class from a model’s output. The poorer signal that supervised learning typically uses.
KL divergence: the loss function used in distillation. Measures how close the student’s distribution is to the teacher’s. Reduces to standard cross-entropy on hard labels.
DistilBERT: half-layer-count student of BERT, trained via distillation. Smaller and faster with comparable performance. ~4 pages, outsized influence.
RoBERTa: same architecture as BERT with a better pretraining recipe (no NSP, dynamic masking, more data). Same size, better quality.
Dynamic masking: RoBERTa’s per-epoch re-masking strategy, vs BERT’s fixed-once approach.
DistilRoBERTa: distilled version of RoBERTa; combines compression and recipe improvements.
DistilBERT is BERT compressed via distillation. RoBERTa is BERT trained better. Same family, different problems.