Skip to content

References: BERT, part two: pretraining objectives and the train-then-fine-tune workflow

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the training half of Stanford CME 295 Lecture 2's
BERT section: pretraining objectives (MLM and NSP) and the
train-then-fine-tune workflow. The previous lesson covered BERT's
architecture. The next lesson covers BERT derivatives (DistilBERT for
compression, RoBERTa for training improvements). Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

A short list, chosen for durability.

Topics that build on or sit beside this one.

  • The BERT family (training-recipe variants). The next lesson covers DistilBERT (compression via distillation) and RoBERTa (better training recipe). Other variants worth knowing about by name: ALBERT (parameter-efficient via cross-layer weight sharing), ELECTRA (replaces MLM with replaced-token-detection for sample efficiency), DeBERTa (disentangled attention with separate position encoding).

  • Fine-tuning beyond classification. Generative fine-tuning (instruction-tuning of decoder-only models, covered in Phase 4) extends the train-then-fine-tune pattern to generation. The shape is similar (pre-trained model + smaller adapt step + labeled data) even though the head and the loss differ.

  • Parameter-efficient fine-tuning (LoRA, adapters). When fine-tuning the full encoder is expensive or impractical, parameter-efficient techniques add small trainable layers (LoRA matrices, adapter modules) to a frozen encoder. The idea is the same as classic fine-tuning, but the tunable surface is much smaller. Worth reading if you fine-tune large models on a budget.

  • The “post-BERT era” and decoder-only’s rise. Encoder-only models (BERT and family) dominated NLP from 2018 through about 2020. Then GPT-3-style decoder-only models started to dominate the discourse, in part because of in-context learning capabilities the encoder-only models didn’t offer. Both branches still ship; the use cases differ. Background context for Phase 4 on tuning.

  • Where to go next. The next lesson covers BERT derivatives: DistilBERT (compression via knowledge distillation) and RoBERTa (which dropped NSP and showed that training MLM longer on more data hurts nothing). That lesson closes Phase 2.

The primary papers, in chronological order.

None selected for this lesson. BERT’s training recipe and the train-then-fine-tune pattern are well-established and the relevant discussion has consolidated into the academic literature plus the Hugging Face documentation cycle. Durable references will be added at a future quarterly review if any consolidate.