BERT, part two: pretraining objectives and the train-then-fine-tune workflow
What you’ll learn
Section titled “What you’ll learn”This is lesson 9 of Phase 2, How models think: the transformer architecture, in Track 5 (AI Foundations). The previous lesson covered BERT’s architecture (encoder-only, bidirectional, structural tokens, three additive embeddings). This lesson covers what BERT was trained to do. Bidirectional self-attention means every token can see every other token, which makes next-token prediction trivial. So BERT used masked language model (MLM) and next sentence prediction (NSP) as pretraining objectives, plus a second stage of fine-tuning that adapts the pre-trained encoder to a specific labeled task. The lesson walks both objectives in detail (including why MLM uses an 80/10/10 mix), the two-stage workflow, and the two common fine-tuning patterns.
Where this fits
Section titled “Where this fits”This is lesson 9 of Phase 2, How models think: the transformer architecture. BERT is a single mental object split across two consecutive lessons; this is the second one. The previous lesson (BERT, part one: the bidirectional encoder and its structural tokens) covered the architecture. The next lesson, BERT derivatives: DistilBERT and RoBERTa, closes Phase 2 by showing how two follow-up papers compressed BERT (DistilBERT) and improved its training recipe (RoBERTa).
Before you start
Section titled “Before you start”Prerequisites: the BERT architecture lesson is required. We assume you understand what bidirectional self-attention means, what the structural tokens (CLS, SEP) do, and how the three additive embeddings shape the input.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain why bidirectionality forced BERT to use different pretraining objectives than next-token prediction
- Walk through MLM with its 80/10/10 masking mix and explain why the mix is not arbitrary
- Walk through NSP and the role of the CLS-head classifier on top of the bidirectional encoder
- Describe the two-stage train-then-fine-tune workflow and pick the right fine-tuning head (CLS for whole-input classification, per-token for span detection and named-entity recognition)
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 12 minutes (a fine-tuning pattern matching exercise across five task scenarios plus a walked training-loop trace through MLM and a sentiment fine-tune on the same input)
- Difficulty: standard