Always at position 0 of the input; its output embedding integrates information from every token; classification heads attach on top of it (next lesson)
[SEP]
Marks sentence boundaries when the input has two sentences; enables NSP (next lesson)
[PAD]
Fills out a fixed-length training batch; ignored by attention via masking
[MASK]
Pre-training-only artifact; covered in the next lesson
BERT (Bidirectional Encoder Representations from Transformers): encoder-only transformer with bidirectional self-attention.
Bidirectional self-attention: self-attention computed without a causal mask, so every token attends to every other token in both directions in one pass.
CLS token: special classification token at position 0 of every input; its output embedding is used by sentence-level classification heads (next lesson).
SEP token: special separator token marking sentence boundaries.
WordPiece: the tokenizer used by BERT; learns merge rules to build a ~30k vocabulary from a training corpus, in the same family as BPE.
Segment embedding: new in BERT; a learned vector indicating which sentence (A or B) a token belongs to.
BERT drops the decoder and removes the causal mask. That is what makes the encoder bidirectional. CLS, SEP, and three additive embeddings shape the input. The next lesson trains it.