Summary: BERT, part two: pretraining objectives and the train-then-fine-tune workflow

Bidirectionality forced new pretraining objectives. Decoder-only models train via next-token prediction; that is trivial when the model can see the future. BERT was trained on masked language model (MLM) and next sentence prediction (NSP) instead. Then a second stage, fine-tuning, adapts the pre-trained encoder to a specific labeled task by attaching a small task-specific head (typically a linear layer) on top.

This summary is the scan-it-in-five-minutes version. The full lesson walks both pretraining objectives in detail, the train-then-fine-tune workflow, and the two common fine-tuning patterns. The previous lesson covered BERT’s architecture; this lesson is what it was trained to do.

Core ideas

Why new objectives at all. Bidirectional self-attention means every token can see every other token, including the future. Next-token prediction collapses (the answer is in the input). BERT needed objectives that work with bidirectionality.
MLM (Masked Language Model). Randomly select a fraction of tokens (BERT paper: 15%); replace them per the 80/10/10 mix: 80% with [MASK], 10% with a random token, 10% unchanged. Train the model to predict the original token in all three cases.
Why the 80/10/10 mix is not arbitrary. If we always replaced selected tokens with [MASK], the model would learn that [MASK] is the only place it needs to predict. At inference there are no [MASK] tokens, so the mix forces the model to maintain useful representations for every token.
NSP (Next Sentence Prediction). Pair two sentences; 50% are real consecutive pairs, 50% are random. A small classifier head on top of the CLS token’s output embedding predicts whether B genuinely follows A. Teaches sentence-level relationships.
Pre-training and fine-tuning, two stages. Pre-training runs MLM and NSP simultaneously on a large unlabeled corpus once (expensive). Fine-tuning adapts the pre-trained encoder to a specific labeled task by adding a small head, typically training end-to-end on relatively little labeled data.
The big practical win. Pre-training uses unlabeled data (free at scale); fine-tuning typically needs hundreds to thousands of labeled examples because the pre-trained representations are already useful.
Two common fine-tuning shapes cover most uses. Whole-input classification: head on CLS output (sentiment, intent, document classification). Per-token: heads on every token’s output (named-entity recognition, question-answering with start/end span detection).
The head decides the task. Same encoder, same input, different head, different task. Whether you get a single label or per-token labels depends entirely on which output of the encoder you read.
Walked example, training side. During pre-training, “this teddy [MASK] is so cute.” asks the model to produce “bear” from bidirectional context. During fine-tuning for sentiment, the input is plain text again ([CLS] this teddy bear is so cute . [SEP]), the encoder produces output embeddings, and a linear classifier on the CLS output emits the label.
Pitfall: thinking [MASK] shows up at inference. It does not; [MASK] is a pre-training-only artifact. The 80/10/10 mix exists precisely so the model handles inputs without [MASK] tokens.
Pitfall: assuming all encoder-only models use NSP. RoBERTa (next lesson) showed that dropping NSP and training MLM on more data hurts nothing.
Pitfall: confusing pre-training cost with fine-tuning cost. Pre-training is the expensive one-time stage; fine-tuning is cheap and happens many times.
Pitfall: picking the wrong head. CLS-head for whole-input tasks; per-token heads for token-level tasks. Get the rule of thumb right and the rest follows.

What changes for you

When a model card says “BERT” or “BERT-like,” you now know what training produced it (MLM + NSP) and what the typical fine-tuning recipe is (attach a small head, train on a small labeled dataset). When a fine-tuning project lands on your desk, you know where the head goes (CLS for whole-input, per-token outputs for token-level tasks). The next lesson covers two of the most influential BERT derivatives (DistilBERT for compression via distillation, RoBERTa for a better training recipe) and closes Phase 2.

Bidirectionality forced MLM and NSP. The 80/10/10 mix is not arbitrary.
Pre-train once, fine-tune many times.
CLS for whole-input classification, per-token outputs for span detection.