DistilBERT and RoBERTa: cheatsheet

The one idea that matters

BERT had three limitations. Two derivatives ship widely:

  Latency / size limit         →  DistilBERT  (compress via distillation)
  Pretraining-complexity limit →  RoBERTa     (drop NSP, more data, dynamic mask)

Same architecture family, different problems.

BERT’s three limitations

Limitation	Detail	Addressed by
Context length	512 tokens originally	Attention efficiency tricks (L2.3)
Latency and size	110M parameters at BERT-base scale	DistilBERT (this lesson)
Pretraining complexity	MLM + NSP without empirical justification for NSP	RoBERTa (this lesson)

Knowledge distillation, the concept

Teacher (large, well-trained)  →  produces full output distribution
Student (smaller)              →  trained to match teacher's distribution

Loss: KL divergence between student and teacher distributions.
Cross-entropy is a special case (when target is a hard label).

| Why “soft targets” | Hinton’s framing: “the soft targets contain almost all the knowledge.” A model’s full output distribution carries information about how each input relates to every class, not just the chosen one. That richness is the training signal a hard label throws away. |

DistilBERT, in detail

Property	Detail
Architectural change	Halve the layer count (BERT-base 12 layers → DistilBERT 6)
Training change	Distillation: train student to match BERT teacher’s output distribution, not original task hard labels
Loss	KL divergence (plus standard MLM loss; in practice combined)
Result	Roughly half the size, correspondingly faster inference, almost the same downstream performance
Paper	~4 pages, outsized influence

RoBERTa, in detail

Property	Detail
Architectural change	None (same as BERT)
Recipe change 1	Drop NSP entirely. Lecturer’s framing: removing NSP led to “no decrease in performance almost.” Original BERT had assumed NSP was helping; the empirical evidence said otherwise.
Recipe change 2	Dynamic masking. In original BERT, masking pattern is fixed once during data prep. RoBERTa re-masks every epoch. Effectively gives many more “different” training examples from the same source data.
Recipe change 3	More data. BERT was “vastly undertrained” per the lecturer. RoBERTa scaled up the pretraining data significantly in volume and diversity. Benchmark performance improved correspondingly.
Result	Same model size, better quality on benchmarks.

DistilBERT vs RoBERTa, side by side

	DistilBERT	RoBERTa
Limitation addressed	Latency and size	Pretraining quality
Architectural change	Half the layers	None
Training change	Distillation against BERT teacher	Drop NSP, dynamic masking, more data
Trade-off	Smaller and faster, slight quality drop	Same size, better quality
Paper length	~4 pages	Longer; more empirical comparisons
Use when…	Latency or inference cost is the binding constraint	Quality is the binding constraint at the same size

They stack

DistilRoBERTa exists in the wider ecosystem as a distilled version of RoBERTa. It combines RoBERTa’s training improvements with DistilBERT’s compression approach. Use it when both quality and latency matter.

What you see in the wild

Phrase	What it means
DistilBERT	Compressed (6-layer) variant of BERT trained via distillation
RoBERTa	Same architecture as BERT, better pretraining recipe
DistilRoBERTa	Compressed variant of RoBERTa; combines both improvements
”Distilled from…”	Smaller model trained to mimic the named larger teacher; pattern applies far beyond BERT
”No NSP”	Recipe choice; usually paired with RoBERTa or RoBERTa-style pretraining
”Dynamic masking”	RoBERTa’s per-epoch re-masking; sometimes called online masking

Pitfalls to dodge

Pitfall	Reality
DistilBERT and RoBERTa are competitors	No, they address different limitations and stack (DistilRoBERTa exists).
Distillation is “just a smaller copy”	No. Distillation requires the teacher’s full output distribution as the training signal. Without that, the smaller model is just a smaller model trained from scratch.
RoBERTa changed the architecture	No. Same as BERT. Only the pretraining recipe is different.
The BERT paper’s NSP defense was empirical proof	No. It was an assumption that turned out to be wrong on closer examination. RoBERTa tested the assumption and found it did not hold. Worth noticing when you read paper design rationales: assumptions and tested results are different things.

Glossary

Knowledge distillation: training a smaller “student” model to match the output distribution of a larger “teacher” model, instead of training the student on the original task’s hard labels.
Soft targets: the full probability distribution over output classes that a trained model produces. The richer signal that distillation uses.
Hard label: the single most-likely class from a model’s output. The poorer signal that supervised learning typically uses.
KL divergence: the loss function used in distillation. Measures how close the student’s distribution is to the teacher’s. Reduces to standard cross-entropy on hard labels.
DistilBERT: half-layer-count student of BERT, trained via distillation. Smaller and faster with comparable performance. ~4 pages, outsized influence.
RoBERTa: same architecture as BERT with a better pretraining recipe (no NSP, dynamic masking, more data). Same size, better quality.
Dynamic masking: RoBERTa’s per-epoch re-masking strategy, vs BERT’s fixed-once approach.
DistilRoBERTa: distilled version of RoBERTa; combines compression and recipe improvements.

DistilBERT is BERT compressed via distillation.
RoBERTa is BERT trained better.
Same family, different problems.