DistilBERT and RoBERTa, in brief

What you’ll learn

This is lesson 10 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). This is the Phase 2 closer. The previous two lessons covered BERT itself: the architecture (encoder-only, bidirectional, structural tokens, three additive embeddings) and the training (MLM with the 80/10/10 mix, NSP, the train-then-fine-tune workflow). Course materials are at cme295.stanford.edu.

This lesson covers two of the most influential derivatives that built on BERT to address its limitations. DistilBERT addressed the latency-and-size problem through knowledge distillation: train a smaller “student” model to mimic the output distribution of a larger “teacher” model. The conceptual root is Hinton’s soft-targets framing (KL divergence between teacher and student distributions; cross-entropy is the special case when the target is a hard label); the application halves the layer count and trains the student against the original BERT’s output distribution. The empirical result is about 40% smaller (roughly 66M parameters vs BERT-base’s 110M, since the layer count halves but other parameters like embeddings stay) at almost the same downstream performance. RoBERTa addressed the pretraining-complexity problem by dropping NSP entirely (it turned out not to help, as the lecturer notes), re-masking on every epoch instead of once during data prep, and training MLM on much more data. The lesson closes by naming DistilRoBERTa as the natural combination for production systems where both speed and quality matter.

Where this fits

This is lesson 10 of Phase 2, How models think: the transformer architecture. The previous two lessons split BERT across architecture and pretraining and fine-tuning. This lesson closes Phase 2 by showing how two follow-on papers sharpened the original. Phase 3, How models learn from text: pretraining and scale, is next.

Before you start

Prerequisites: the two BERT lessons are required: BERT, part one: architecture and BERT, part two: pretraining and fine-tuning. We assume you understand what BERT’s MLM and NSP pretraining objectives do, what fine-tuning means, and what a 100-million-parameter encoder feels like in terms of compute. If those feel unfamiliar, read the BERT lessons first.

By the end, you’ll be able to

Identify the three limitations of the original BERT model the lecturer flags (context length, latency/size, pretraining complexity)
Explain knowledge distillation as a concept (teacher/student, soft targets, KL divergence) and why it produces a smaller-but-comparable student model
Describe what DistilBERT does specifically (halve the layer count, train via distillation against the original BERT) and the empirical result (~40% smaller, around 66M params vs BERT-base’s 110M, almost the same downstream performance)
Walk through RoBERTa’s three changes (drop NSP, dynamic masking, much more data) and what each one revealed about the original BERT recipe
Recognize when to reach for DistilBERT vs RoBERTa vs DistilRoBERTa, depending on whether latency, quality, or both are the binding constraint

Time and difficulty

Read time: about 18 minutes
Practice time: about 12 minutes (a teacher/student walk-through showing why soft targets carry more information than hard labels, plus a comparison of the BERT and RoBERTa training recipes)
Difficulty: standard