Practice: BERT, part two: pretraining objectives and the train-then-fine-tune workflow

Self-check

Answer in your head (or on paper) before opening the collapsible.

1. Why couldn’t BERT just train via next-token prediction the way decoder-only models do?

Show answer

Bidirectional self-attention means every token can see every other token, including the ones that would otherwise be the prediction target. Predicting position 6 from positions 1, 2, 3, 4, 5, 6, 7, 8 (where 6 is in the input) is not learning, it is reading. Next-token prediction needs the future to be hidden; BERT’s no-causal-mask design hides nothing. So BERT needed objectives that leave something for the model to figure out despite seeing the whole input.

2. Walk through MLM with the 80/10/10 mix.

Show answer

MLM (Masked Language Model) is BERT’s primary pre-training objective.

Step 1. Randomly select a fraction of the tokens (the BERT paper specifies 15%).

Step 2. For each selected token, decide what to do via the 80/10/10 mix:

80% of the time, replace it with the special [MASK] token.
10% of the time, replace it with a random token from the vocabulary.
10% of the time, keep it unchanged.

Step 3. The model is trained to predict the original token at every selected position, regardless of which of the three operations was applied.

Why the mix? If we always replaced selected tokens with [MASK], the model would learn that [MASK] is the only place it needs to predict. That breaks at inference time, when there are no [MASK] tokens. The mix forces the model to maintain useful representations for every token, so it can do its job whether the token is masked, random, or unchanged.

3. Walk through NSP.

Show answer

NSP (Next Sentence Prediction) is BERT’s secondary pre-training objective; it complements MLM at the sentence level rather than the token level.

Step 1. Pair two sentences (sentence A followed by sentence B) from the corpus. 50% of the time, B is the actual next sentence after A in the source text. 50% of the time, B is a random sentence pulled from elsewhere.

Step 2. Format the input as [CLS] sentence_A [SEP] sentence_B [SEP], with Segment A embeddings on tokens in the first sentence and Segment B embeddings on tokens in the second sentence.

Step 3. A small classification head sits on top of the CLS token’s output embedding. It predicts a binary label: “B truly follows A” or “B is a random sentence.”

Why? The CLS embedding integrates information from both sentences (via bidirectional self-attention plus the segment encoding telling the model which tokens belong to which sentence), so it serves as a sentence-pair-level representation. NSP teaches the model to relate one chunk of text to another, complementing MLM’s per-token focus.

4. What are the two stages in BERT’s training workflow, and what happens in each?

Show answer

Pre-training. Run both MLM and NSP simultaneously on a large unlabeled corpus. The output is a set of pre-trained weights for the encoder (token embeddings, position embeddings, segment embeddings, every encoder block’s parameters). Expensive; happens once per model release.

Fine-tuning. Take the pre-trained encoder and adapt it to a specific labeled task. Attach a small task-specific head (typically a linear layer), train on labeled data. Either freeze the pre-trained weights and only train the new head, or train everything end-to-end (the choice depends on how different the target task is and how much labeled data you have).

The big win: pre-training uses unlabeled data (free at scale); fine-tuning typically needs little labeled data because the pre-trained representations are already useful.

5. The same input “this teddy bear is so cute.” shows up in pre-training and in fine-tuning. What is different about how the model sees it in each stage?

Show answer

During pre-training (MLM). Some token in the input may be replaced according to the 80/10/10 mix: maybe [MASK], maybe a random vocabulary token, maybe left unchanged. The model is trained to predict the original token at the selected position. Across millions of such examples, it learns rich token-level representations.

During fine-tuning. The input is plain text plus the structural tokens. No [MASK]. The encoder produces context-aware output embeddings; a small task-specific head reads them. For sentiment, a linear classifier on the CLS output. For per-token tasks, heads on every token’s output.

The same encoder weights run in both stages; what changes is what the model is being asked to predict and which output is being read.

6. CLS-head versus per-token outputs: how do you decide?

Show answer

CLS-head is for whole-input classification: the question is “what is the label for this input as a whole?” Sentiment extraction, intent detection, document classification, sentence-pair classification (entailment, paraphrase detection) all follow this pattern.

Per-token heads are for token-level tasks: the question is “what is the label at each position?” Named-entity recognition (label per token), question answering with span detection (start position and end position of the answer span), part-of-speech tagging.

The rule of thumb is exactly that: classify the whole input, use CLS; classify per-token, use per-token outputs.

Try it yourself: pick the right fine-tuning head

For each task below, decide whether to attach the classification head on the CLS token’s output embedding or on per-token output embeddings.

a) Sentiment extraction: input is a movie review; output is “positive” / “negative” / “neutral.”

Show answer

CLS token’s output embedding. This is a sentence-level (whole-input) classification task. The CLS embedding integrates context from every token; a small linear classifier on top reads it and produces the label.

b) Question answering with span detection: input is a passage and a question; output is the start and end positions of the answer inside the passage.

Show answer

Per-token output embeddings. This is a token-level task: for each position in the input, decide whether it’s the start of the answer span, the end of the answer span, or neither. Typically two linear heads (one for start, one for end) applied to every token’s output embedding.

c) Named-entity recognition: input is a sentence; output is a label per word indicating whether it’s a person, organization, location, or none.

Show answer

Per-token output embeddings. Per-token classification: for each input token, predict its entity label. One linear classifier head applied to every token’s output embedding, with as many output classes as you have entity types (plus “none”).

d) Document classification: input is an entire news article; output is a category like “sports” / “politics” / “tech.”

Show answer

CLS token’s output embedding. Same shape as sentiment (whole-document classification): the CLS embedding summarizes the input, and a linear classifier produces the category.

e) Paraphrase detection: input is two sentences; output is “paraphrase” or “not paraphrase.”

Show answer

CLS token’s output embedding. The two sentences are formatted with [CLS] A [SEP] B [SEP], segment embeddings split A and B, and the CLS output embedding integrates both via bidirectional self-attention. A binary classifier on the CLS output reads it and decides. This is structurally the same shape as NSP, but applied to a different sentence-pair labeling task.

Sanity check: the rule of thumb is “classify the whole input → use CLS; classify per-token → use per-token outputs.” Once you have that, the fine-tuning shape for any new task follows the same pattern: pre-trained encoder + task-specific head + labeled examples.

Try it yourself: trace the training loop

About 10 minutes with a pen.

For the input “the small dog barks at the mailman.”:

a) Show one possible MLM masking pattern at the 15% rate. Pick which token gets masked; pick what happens to it under the 80/10/10 mix.

Show example

15% of, say, 9 tokens (including the structural tokens) is roughly 1 token to mask. Suppose we pick “barks”.

Apply the 80/10/10 mix:

80% case: replace it with [MASK]. Input becomes [CLS] the small dog [MASK] at the mailman . [SEP]. Train the model to predict “barks” at that position.
10% case: replace it with a random vocabulary token (say, “telescope”). Input becomes [CLS] the small dog telescope at the mailman . [SEP]. Train the model to predict “barks” at that position (despite the misleading random token).
10% case: leave it unchanged. Input stays [CLS] the small dog barks at the mailman . [SEP]. Train the model to predict “barks” at that position anyway.

In all three cases, the loss is computed at the selected position only, against the original token “barks”.

b) Now imagine pre-training has finished. Set up a sentiment fine-tuning task on this same input. Where does the head attach, and what does it produce?

Show answer

The encoder runs on the plain input [CLS] the small dog barks at the mailman . [SEP] (no [MASK]). It produces output embeddings, one per input position. A linear classification head reads the CLS output (position 0) and produces a sentiment label (e.g., “neutral” for this slightly hostile-sounding mailman scenario). The other token outputs are discarded for this task.

If you reused the same encoder for a per-token task (e.g., named-entity recognition), the head would attach to every token’s output instead of just CLS. Same encoder, same input, different head, different task.

Flashcards

Twelve cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page.

Q. Why couldn't BERT train via next-token prediction?

Bidirectional self-attention means every token can see every other token, including the future ones. Next-token prediction collapses (the answer is in the input). BERT needed objectives that leave something for the model to figure out despite seeing the whole sequence.

Q. What is MLM and what is the 80/10/10 mix?

Masked Language Model. Randomly select a fraction of tokens (BERT paper: 15%); for each, decide via 80/10/10: 80% replace with [MASK], 10% replace with a random token, 10% keep unchanged. Train the model to predict the original token in all three cases.

Q. Why the 80/10/10 mix instead of always replacing with [MASK]?

If we always used [MASK], the model would learn that [MASK] is the only place it needs to predict. At inference there are no [MASK] tokens, so that breaks. The mix forces the model to maintain useful representations for every token regardless of whether it’s masked, random, or unchanged.

Q. What is NSP?

Next Sentence Prediction. Pair two sentences (50% real consecutive, 50% random). A classification head on the CLS token’s output predicts whether B genuinely follows A. Teaches the model sentence-level relationships, complementing MLM’s per-token focus.

Q. What are the two stages in BERT's training workflow?

Pre-training: run MLM and NSP simultaneously on a large unlabeled corpus. Fine-tuning: attach a task-specific head (typically a linear layer) on the pre-trained encoder, train on labeled data for the target task. Pre-training is expensive and happens once; fine-tuning is cheap and happens per task.

Q. Why is the train-then-fine-tune workflow such a practical win?

Pre-training uses unlabeled data, which is essentially free at scale. The pre-trained encoder is reusable across many downstream tasks. Fine-tuning typically needs only hundreds to thousands of labeled examples because the pre-trained representations already know how to read text.

Q. When do you use a classification head on CLS vs per-token outputs?

CLS for whole-input classification (sentiment, intent, document classification, sentence-pair classification). Per-token for token-level tasks (named-entity recognition, question-answering with span detection, part-of-speech tagging). Rule of thumb: classify the whole input → CLS; classify per-token → per-token outputs.

Q. Common pitfall: does the [MASK] token show up at inference?

No. [MASK] is a pre-training-only artifact. At fine-tuning and inference, the input is real text plus the structural tokens (CLS, SEP). The 80/10/10 mix during pre-training exists precisely so the model handles inputs without [MASK] tokens at inference time.

Q. Common pitfall: do all encoder-only models use NSP?

No. BERT used both MLM and NSP. RoBERTa (next lesson) showed that dropping NSP and training MLM longer on more data hurts nothing; NSP turned out to be less load-bearing than the original BERT paper assumed.

Q. Common pitfall: pre-training cost vs fine-tuning cost?

Different cost profiles. Pre-training is the expensive one-time stage that produces the pre-trained encoder weights. Fine-tuning is cheap and happens many times, once per downstream task. When people complain about “the cost of training BERT” they usually mean pre-training; when they talk about “training a BERT classifier for sentiment” they mean fine-tuning.

Q. What does the head decide?

The task. Same encoder, same input, different head, different task. Whether you get a single label or per-token labels depends entirely on which output of the encoder you read.

Q. What is the one-sentence takeaway?

Bidirectionality forced MLM and NSP. The 80/10/10 mix is not arbitrary. Pre-train once, fine-tune many times. CLS for whole-input classification, per-token outputs for span detection.