Skip to content

Cheatsheet: How these models keep improving: DistilBERT and RoBERTa

BERT had three limitations. Two derivatives ship widely:
Latency / size limit → DistilBERT (compress via distillation)
Pretraining-complexity limit → RoBERTa (drop NSP, more data, dynamic mask)
Same architecture family, different problems.
LimitationDetailAddressed by
Context length512 tokens originallyAttention efficiency tricks (L2.3)
Latency and size110M parameters at BERT-base scaleDistilBERT (this lesson)
Pretraining complexityMLM + NSP without empirical justification for NSPRoBERTa (this lesson)
Teacher (large, well-trained) → produces full output distribution
Student (smaller) → trained to match teacher's distribution
Loss: KL divergence between student and teacher distributions.
Cross-entropy is a special case (when target is a hard label).

| Why “soft targets” | Hinton’s framing: “the soft targets contain almost all the knowledge.” A model’s full output distribution carries information about how each input relates to every class, not just the chosen one. That richness is the training signal a hard label throws away. |

PropertyDetail
Architectural changeHalve the layer count (BERT-base 12 layers → DistilBERT 6)
Training changeDistillation: train student to match BERT teacher’s output distribution, not original task hard labels
LossKL divergence (plus standard MLM loss; in practice combined)
ResultRoughly half the size, correspondingly faster inference, almost the same downstream performance
Paper~4 pages, outsized influence
PropertyDetail
Architectural changeNone (same as BERT)
Recipe change 1Drop NSP entirely. Lecturer’s framing: removing NSP led to “no decrease in performance almost.” Original BERT had assumed NSP was helping; the empirical evidence said otherwise.
Recipe change 2Dynamic masking. In original BERT, masking pattern is fixed once during data prep. RoBERTa re-masks every epoch. Effectively gives many more “different” training examples from the same source data.
Recipe change 3More data. BERT was “vastly undertrained” per the lecturer. RoBERTa scaled up the pretraining data significantly in volume and diversity. Benchmark performance improved correspondingly.
ResultSame model size, better quality on benchmarks.
DistilBERTRoBERTa
Limitation addressedLatency and sizePretraining quality
Architectural changeHalf the layersNone
Training changeDistillation against BERT teacherDrop NSP, dynamic masking, more data
Trade-offSmaller and faster, slight quality dropSame size, better quality
Paper length~4 pagesLonger; more empirical comparisons
Use when…Latency or inference cost is the binding constraintQuality is the binding constraint at the same size

DistilRoBERTa exists in the wider ecosystem as a distilled version of RoBERTa. It combines RoBERTa’s training improvements with DistilBERT’s compression approach. Use it when both quality and latency matter.

PhraseWhat it means
DistilBERTCompressed (6-layer) variant of BERT trained via distillation
RoBERTaSame architecture as BERT, better pretraining recipe
DistilRoBERTaCompressed variant of RoBERTa; combines both improvements
”Distilled from…”Smaller model trained to mimic the named larger teacher; pattern applies far beyond BERT
”No NSP”Recipe choice; usually paired with RoBERTa or RoBERTa-style pretraining
”Dynamic masking”RoBERTa’s per-epoch re-masking; sometimes called online masking
PitfallReality
DistilBERT and RoBERTa are competitorsNo, they address different limitations and stack (DistilRoBERTa exists).
Distillation is “just a smaller copy”No. Distillation requires the teacher’s full output distribution as the training signal. Without that, the smaller model is just a smaller model trained from scratch.
RoBERTa changed the architectureNo. Same as BERT. Only the pretraining recipe is different.
The BERT paper’s NSP defense was empirical proofNo. It was an assumption that turned out to be wrong on closer examination. RoBERTa tested the assumption and found it did not hold. Worth noticing when you read paper design rationales: assumptions and tested results are different things.
  • Knowledge distillation: training a smaller “student” model to match the output distribution of a larger “teacher” model, instead of training the student on the original task’s hard labels.
  • Soft targets: the full probability distribution over output classes that a trained model produces. The richer signal that distillation uses.
  • Hard label: the single most-likely class from a model’s output. The poorer signal that supervised learning typically uses.
  • KL divergence: the loss function used in distillation. Measures how close the student’s distribution is to the teacher’s. Reduces to standard cross-entropy on hard labels.
  • DistilBERT: half-layer-count student of BERT, trained via distillation. Smaller and faster with comparable performance. ~4 pages, outsized influence.
  • RoBERTa: same architecture as BERT with a better pretraining recipe (no NSP, dynamic masking, more data). Same size, better quality.
  • Dynamic masking: RoBERTa’s per-epoch re-masking strategy, vs BERT’s fixed-once approach.
  • DistilRoBERTa: distilled version of RoBERTa; combines compression and recipe improvements.

DistilBERT is BERT compressed via distillation.
RoBERTa is BERT trained better.
Same family, different problems.