Skip to content

Cheatsheet: BERT, part two: pretraining objectives and the train-then-fine-tune workflow

Bidirectionality forced new objectives.
Train with MLM + NSP. Fine-tune with a small head on top.
Pre-train once (expensive, on unlabeled data).
Fine-tune many times (cheap, per task, on labeled data).

Why bidirectionality forced new objectives

Section titled “Why bidirectionality forced new objectives”
SettingWhy next-token prediction worksWhy next-token prediction breaks
Decoder-only (GPT-style)Causal mask hides the future; predicting the next token from the past is a real taskNot applicable; this is the design they use
Encoder-only (BERT)Not applicable; bidirectional self-attention exposes the futureBidirectional means the model can see the answer; predicting it is reading, not learning

So BERT used objectives that leave something to figure out despite seeing the whole input: hide parts of it (MLM), or ask classification questions about pairs (NSP).

Pre-training objective 1: MLM (Masked Language Model)

Section titled “Pre-training objective 1: MLM (Masked Language Model)”
Step 1: select a fraction of tokens (BERT paper: 15%)
Step 2: per selected token, apply the 80/10/10 mix:
80% → replace with [MASK]
10% → replace with a random vocabulary token
10% → leave unchanged
Step 3: train the model to predict the ORIGINAL token in all three cases

| Why the mix? | The model needs to maintain useful representations for every token, not just [MASK] positions, because [MASK] does not appear at inference time. |

Pre-training objective 2: NSP (Next Sentence Prediction)

Section titled “Pre-training objective 2: NSP (Next Sentence Prediction)”
Step 1: pair two sentences from the corpus
Step 2: 50% real consecutive pairs, 50% random pairs
Step 3: classify via a head on the CLS token's output embedding:
"B truly follows A" vs "B is random"

| Why? | Teaches the model sentence-level relationships. Complements MLM’s per-token focus. (Later work, e.g., RoBERTa, showed NSP is less load-bearing than the original BERT paper assumed.) |

StageDataCostOutput
Pre-trainingLarge unlabeled text corpusExpensive (one-time per release)Pre-trained encoder weights
Fine-tuningSmall labeled dataset for target taskCheapTask-specific head on top of pre-trained encoder
TaskHead placementExamples
Sentence-level classificationLinear classifier on CLS output embeddingSentiment, intent, document classification
Sentence-pair classificationLinear classifier on CLS output embeddingEntailment, paraphrase detection
Per-token classificationLinear classifier on every token’s output embeddingNamed-entity recognition, part-of-speech tagging
Span detectionTwo linear heads (start, end) on every token’s output embeddingQuestion answering

Rule of thumb: classify the whole input → use CLS; classify per-token → use per-token outputs.

Pre-training (MLM):
Input shaped: [CLS] this teddy [MASK] is so cute . [SEP]
Target: predict "bear" at the masked position
Loss: cross-entropy on the original token
Fine-tuning (sentiment):
Input: [CLS] this teddy bear is so cute . [SEP] (no [MASK])
Encoder pass: produces output embeddings, one per position
Head: linear classifier on CLS output → "positive"
Other tokens: discarded for this task
PhraseWhat it means
MLM headThe pre-training head that predicts the original token at masked positions
NSP headThe pre-training head that predicts whether sentence B follows sentence A
Fine-tuningAdapting a pre-trained encoder to a specific labeled task by attaching a small head
Frozen encoderA fine-tuning recipe that does not update the pre-trained weights, only the new head
End-to-end fine-tuningA fine-tuning recipe that trains the encoder weights along with the new head
PitfallReality
[MASK] appears at inference timeNo. [MASK] is a pre-training-only artifact. The 80/10/10 mix exists so the model handles real input without [MASK] tokens.
All encoder-only models use NSPNo. RoBERTa (next lesson) drops NSP entirely without losing performance.
Pre-training and fine-tuning have the same cost profileNo. Pre-training is the expensive one-time stage. Fine-tuning is cheap and happens per task.
Pick any head for any taskNo. Whole-input classification → CLS; per-token tasks → per-token outputs. Wrong head, wrong task shape.
  • MLM (Masked Language Model): pre-training objective. Mask a fraction of tokens (BERT paper: 15%) with the 80/10/10 mix; predict the original.
  • NSP (Next Sentence Prediction): pre-training objective. Pair two sentences; predict whether B genuinely follows A via a CLS-head classifier.
  • Pre-training: large-scale training on unlabeled data via MLM + NSP; produces the pre-trained encoder weights.
  • Fine-tuning: task-specific adaptation of the pre-trained encoder by adding a small head and training on labeled data.
  • CLS-head fine-tuning: the most common pattern; a linear classifier reads the CLS output for whole-input classification.
  • Per-token fine-tuning: classifier heads on every token’s output for token-level tasks (NER, span detection).

Bidirectionality forced MLM and NSP. The 80/10/10 mix is not arbitrary.
Pre-train once, fine-tune many times.
CLS for whole-input classification, per-token outputs for span detection.