DistilBERT and RoBERTa: improving on BERT

Most BERT-family models in production today are not BERT. They are DistilBERT or RoBERTa, and each was a one-paper response to a specific limitation in the original.

The previous lesson covered BERT in detail. The lecturer’s parting take is that BERT is widely used in industry for sentiment and other classification tasks, and limited in three concrete ways. This lesson covers two of the most influential follow-up papers and the limitations they addressed.

The two derivatives are DistilBERT and RoBERTa. DistilBERT addresses BERT’s size and latency: 110 million parameters at the BERT-base scale is “quite a lot,” and inference is slow enough to be a real cost in production. The fix is knowledge distillation: train a smaller “student” model to mimic the output distribution of a larger “teacher” model. RoBERTa addresses the pretraining complexity: BERT had two pretraining objectives (MLM and NSP), and the field never had clean evidence that NSP was actually pulling its weight. The fix turned out to be: drop NSP, train MLM longer on more data, and the result is better than BERT.

Both are short papers (DistilBERT is famously four pages) with outsized influence.

BERT’s three limitations

The lecturer names three concrete limitations of the original BERT. Both derivatives respond to one of them; the third (context length) gets addressed by other work that the lecture mentions in passing but does not develop.

Context length. The original BERT had a maximum sequence length of 512 tokens. That was an early-paper limitation; the techniques in lesson 2.3 (sliding window attention and other attention efficiency tricks) became the way the field worked around it. Not addressed by either DistilBERT or RoBERTa specifically; flagged here for completeness.

Latency and size. BERT-base has 110 million parameters. Inference is correspondingly slow and memory-hungry. For production deployments at scale (millions of inference calls per day), the cost is real. DistilBERT addresses this directly.

Pretraining complexity. BERT was pretrained on two objectives simultaneously (MLM and NSP). The combination was assumed to be helpful, but never rigorously tested against alternatives. RoBERTa addresses this: it asks whether NSP is actually load-bearing, and the answer turns out to be no.

Knowledge distillation: the concept

Before walking DistilBERT specifically, the lecturer takes a brief detour through what distillation is. The framing comes from a Hinton, Vinyals, and Dean lecture, which the lecturer treats as the conceptual root of distillation.

The core insight is captured in a quote the lecturer flashes: “the soft targets contain almost all the knowledge.”

Translate. A trained model takes an input and outputs a probability for each possible class. The “hard label” is just the top class (e.g., “positive”). The soft targets are the full set of probabilities. For a sentiment model: 70% positive, 25% neutral, 5% negative.

Hinton’s argument is that the soft set carries facts the hard label throws away. The model gave 25% to neutral and 5% to negative. That tells a learner how the input relates to each class, not just which one won. The hard label drops all of that.

Distillation uses this idea to shrink models. You have a large, well-trained “teacher.” You want a smaller “student” that acts like it. Train the student to match the teacher’s full output for every input, not the task’s hard labels. The student learns from the teacher’s full picture, not just from the right answer.

The loss is KL divergence between teacher and student. There is a clean detail. When the target is a hard label (one class at 1.0, the rest at 0), KL divergence reduces to standard cross-entropy. So distillation does not replace supervised learning. It generalizes the loss with a richer signal.

DistilBERT: the application

DistilBERT applies distillation to BERT. The move is simple. Halve the number of transformer layers. The lecturer says “reduce by two.” The DistilBERT paper drops the layer count to 6 from BERT-base’s 12. Then train the smaller student to match the original BERT’s output.

The lecturer’s framing: “if you reduce the number of layers by two you have a lot of gains and almost the same performance.” The student keeps almost all of BERT’s quality. The DistilBERT paper reports the model at about 66 million parameters versus BERT-base’s roughly 110 million, about 40% smaller (the layer count halves but other parameters like embeddings stay, so the overall reduction is less than half). Inference runs faster.

The DistilBERT paper is famously short, around four pages. The mechanism is simple enough that the paper does not need to be long. Distillation was already a known idea. DistilBERT just applied it cleanly to BERT and showed the result.

Load a DistilBERT checkpoint today and you are using this exact lineage. A smaller model trained to mimic a larger BERT. Most of the larger model’s skill, kept through the soft-target signal.

RoBERTa: the better-trained recipe

RoBERTa is a different kind of improvement. Architecturally, it is essentially the same as BERT (encoder-only, same block design, same input embedding shape). The changes are all in the pretraining recipe.

The headline change: drop NSP entirely. The RoBERTa authors studied whether the next-sentence-prediction objective was actually contributing to pretraining quality. The lecturer’s framing: removing NSP led to “no decrease in performance almost.” The original BERT authors had assumed NSP was helping; the empirical evidence said otherwise. RoBERTa just dropped it.

Two further changes complete the recipe.

Dynamic masking. In the original BERT, the masking pattern for a given input sentence is decided once during data preparation. Every epoch sees the same masking. RoBERTa changed this: re-mask the same input differently on each epoch. Straightforward to implement; effectively gives the model many more “different” training examples from the same source data.

More data. The RoBERTa authors observed that BERT was undertrained. They scaled up the pretraining data significantly (in volume and diversity). The benchmark performance improved correspondingly.

The summary, in the lecturer’s own framing: RoBERTa shows that the original BERT recipe left meaningful performance on the table. NSP was not helping; static masking was wasteful; the data scale was insufficient. Dropping NSP, masking dynamically, and training on more data produced a meaningfully better model from essentially the same architecture.

The two derivatives, side by side

The two papers respond to different problems and use different tools.

	DistilBERT	RoBERTa
Limitation addressed	Latency and size	Pretraining quality (NSP usefulness, masking strategy, data scale)
Architectural change	Half the layers (6 vs 12)	None
Training change	Distillation against the original BERT (KL divergence loss)	Drop NSP; dynamic masking; much more data
Trade-off	Smaller and faster, slight quality drop	Same size, better quality
Paper length	~4 pages	Longer; more empirical comparisons

If you need a smaller, faster encoder model, DistilBERT (or one of its descendants) is the family to reach for. If you need the best-trained BERT-shaped encoder, RoBERTa (or one of its descendants) is the family. Both families ship widely; which one you pick depends on the constraint that matters most for your application.

Why this matters when you use AI

Two consequences worth holding onto when you read AI tooling docs or model cards.

“DistilBERT” and “RoBERTa” are the most common BERT-family names you will see in production. The original BERT itself is less commonly used in new projects than its descendants. Knowing the lineage (which derivative addressed which limitation, what its pretraining looked like) helps you read those projects’ documentation and reason about why a particular model was chosen.
Distillation generalizes well past BERT. The teacher-student-soft-targets recipe applies to any model where you have a strong teacher and want a smaller student. Distillation is a common production pattern for shipping smaller variants of larger models. The conceptual framing here is durable; the BERT-specific application is just one early example.

Common pitfalls

A few mistakes are common enough to be worth naming.

Thinking DistilBERT and RoBERTa are competitors. They address different limitations and can stack. (As an example outside the lecture itself: DistilRoBERTa exists in the wider ecosystem as a distilled version of RoBERTa that combines RoBERTa’s training improvements with DistilBERT’s compression approach.)

Thinking distillation is just a smaller copy of the teacher. It is not. Distillation requires the teacher’s output distribution as the training signal. Without that, the smaller model is just a smaller model trained from scratch on the original task; the soft-targets advantage is lost.

Thinking RoBERTa changed the architecture. It did not. The architecture is the same as BERT. The changes are all in the pretraining recipe (drop NSP, dynamic masking, more data). When a model card mentions RoBERTa, the architecture is BERT; the training is different.

Thinking the BERT paper’s NSP defense was empirical proof. It was an assumption that turned out to be wrong on closer examination. The original paper assumed NSP would help; RoBERTa tested that assumption and found it did not. Worth noticing when you read a paper’s design rationales: assumptions and tested results are different things.

What you should remember

BERT had three concrete limitations. Context length (512 tokens originally; addressed by attention efficiency tricks from Lecture 2.3). Latency and size (110M parameters; addressed by DistilBERT). Pretraining complexity (MLM + NSP; addressed by RoBERTa).
Knowledge distillation trains a smaller student to mimic a larger teacher’s output distribution. The soft targets carry more information than hard labels (Hinton’s framing). KL divergence is the loss; cross-entropy is the special case when the target is a hard label.
DistilBERT halves BERT’s layer count and trains via distillation. About 40% smaller overall (66M parameters vs BERT-base’s 110M) and correspondingly faster; almost the same downstream performance. ~4 pages, outsized influence.
RoBERTa is the same architecture with a better training recipe. Drop NSP (it was not helping). Dynamic masking (re-mask on every epoch instead of once). Much more data (BERT was undertrained). Same model shape, better empirical results.
The lineage matters. When you load DistilBERT or RoBERTa from a model repository, you know what it is and why it exists. Most BERT-family models in production today are derivatives, not the original.

Closing Lecture 2

Lecture 2 walked from the parts of the original 2017 transformer that have changed (position embeddings, normalization, attention efficiency) through the three architectural branches the field built (encoder-decoder with T5, decoder-only as the modern LLM default, encoder-only with BERT) and ended on BERT’s two most influential descendants. The BERT family is now in your working vocabulary. From here the curriculum opens up: the next phases cover why these models cost so much to build, why a base model needs a separate tuning stage before it is actually useful, and how you steer one when you ask it questions.

If you remember one thing

DistilBERT is BERT compressed via distillation.
RoBERTa is BERT trained better.
Same family, different problems.