References: BERT, part two: pretraining objectives and the train-then-fine-tune workflow

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the training half of Stanford CME 295 Lecture 2's
BERT section: pretraining objectives (MLM and NSP) and the
train-then-fine-tune workflow. The previous lesson covered BERT's
architecture. The next lesson covers BERT derivatives (DistilBERT for
compression, RoBERTa for training improvements). Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al., 2019. The BERT paper. Section 3 covers the architecture and the pretraining objectives (MLM, NSP); section 4 covers fine-tuning and the canonical task setups (GLUE, SQuAD). The paper is famously readable for a foundational ML paper.
“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Liu et al., 2019. The paper that questioned NSP’s contribution and showed that dropping it (plus dynamic masking and more pre-training data) produces a meaningfully better model from the same architecture. Worth reading if you want the empirical argument behind why later work treats NSP as optional. Covered as the next lesson’s main subject.
“DistilBERT, a distilled version of BERT”, Sanh et al., 2019. The distillation paper that compresses BERT to about 40% the size at almost the same quality. Covered in the next lesson alongside RoBERTa.
Hugging Face fine-tuning tutorial. The de facto reference for fine-tuning BERT-family models in practice. Covers both head shapes (whole-input classification, per-token), with running code examples.
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, Reimers & Gurevych, 2019. A different fine-tuning setup that turns BERT into a strong sentence-embedding model for similarity tasks. Worth reading if you build retrieval or semantic-search systems.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The BERT family (training-recipe variants). The next lesson covers DistilBERT (compression via distillation) and RoBERTa (better training recipe). Other variants worth knowing about by name: ALBERT (parameter-efficient via cross-layer weight sharing), ELECTRA (replaces MLM with replaced-token-detection for sample efficiency), DeBERTa (disentangled attention with separate position encoding).
Fine-tuning beyond classification. Generative fine-tuning (instruction-tuning of decoder-only models, covered in Phase 4) extends the train-then-fine-tune pattern to generation. The shape is similar (pre-trained model + smaller adapt step + labeled data) even though the head and the loss differ.
Parameter-efficient fine-tuning (LoRA, adapters). When fine-tuning the full encoder is expensive or impractical, parameter-efficient techniques add small trainable layers (LoRA matrices, adapter modules) to a frozen encoder. The idea is the same as classic fine-tuning, but the tunable surface is much smaller. Worth reading if you fine-tune large models on a budget.
The “post-BERT era” and decoder-only’s rise. Encoder-only models (BERT and family) dominated NLP from 2018 through about 2020. Then GPT-3-style decoder-only models started to dominate the discourse, in part because of in-context learning capabilities the encoder-only models didn’t offer. Both branches still ship; the use cases differ. Background context for Phase 4 on tuning.
Where to go next. The next lesson covers BERT derivatives: DistilBERT (compression via knowledge distillation) and RoBERTa (which dropped NSP and showed that training MLM longer on more data hurts nothing). That lesson closes Phase 2.

Original sources

The primary papers, in chronological order.

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al., 2019.
“RoBERTa: A Robustly Optimized BERT Pretraining Approach”, Liu et al., 2019.
“DistilBERT, a distilled version of BERT”, Sanh et al., 2019.
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, Reimers & Gurevych, 2019.

Community discussion

None selected for this lesson. BERT’s training recipe and the train-then-fine-tune pattern are well-established and the relevant discussion has consolidated into the academic literature plus the Hugging Face documentation cycle. Durable references will be added at a future quarterly review if any consolidate.