Skip to content

BERT, part two: pretraining objectives and the train-then-fine-tune workflow

This is lesson 9 of Phase 2, How models think: the transformer architecture, in Track 5 (AI Foundations). The previous lesson covered BERT’s architecture (encoder-only, bidirectional, structural tokens, three additive embeddings). This lesson covers what BERT was trained to do. Bidirectional self-attention means every token can see every other token, which makes next-token prediction trivial. So BERT used masked language model (MLM) and next sentence prediction (NSP) as pretraining objectives, plus a second stage of fine-tuning that adapts the pre-trained encoder to a specific labeled task. The lesson walks both objectives in detail (including why MLM uses an 80/10/10 mix), the two-stage workflow, and the two common fine-tuning patterns.

This is lesson 9 of Phase 2, How models think: the transformer architecture. BERT is a single mental object split across two consecutive lessons; this is the second one. The previous lesson (BERT, part one: the bidirectional encoder and its structural tokens) covered the architecture. The next lesson, BERT derivatives: DistilBERT and RoBERTa, closes Phase 2 by showing how two follow-up papers compressed BERT (DistilBERT) and improved its training recipe (RoBERTa).

Prerequisites: the BERT architecture lesson is required. We assume you understand what bidirectional self-attention means, what the structural tokens (CLS, SEP) do, and how the three additive embeddings shape the input.

  • Explain why bidirectionality forced BERT to use different pretraining objectives than next-token prediction
  • Walk through MLM with its 80/10/10 masking mix and explain why the mix is not arbitrary
  • Walk through NSP and the role of the CLS-head classifier on top of the bidirectional encoder
  • Describe the two-stage train-then-fine-tune workflow and pick the right fine-tuning head (CLS for whole-input classification, per-token for span detection and named-entity recognition)
  • Read time: about 13 minutes
  • Practice time: about 12 minutes (a fine-tuning pattern matching exercise across five task scenarios plus a walked training-loop trace through MLM and a sentiment fine-tune on the same input)
  • Difficulty: standard