Skip to content

Cheatsheet: BERT, part one: the bidirectional encoder and its structural tokens

BERT = Bidirectional Encoder Representations from Transformers
Drop the decoder. Keep the encoder.
Self-attention without a causal mask is bidirectional.
This lesson: architecture only. The next lesson: training.
PropertyBERTDecoder-only (e.g., GPT-style)
StacksEncoder onlyDecoder only
Self-attention maskingNone (bidirectional)Causal (each token sees only past)
Forward passes for bidirectional contextOneNot applicable (decoder-only is never bidirectional)
Naturally good atClassification, embeddings, span detectionGeneration, completion, chat
Cannot do naturallyText generationBidirectional context integration
TokenPurpose
[CLS]Always at position 0 of the input; its output embedding integrates information from every token; classification heads attach on top of it (next lesson)
[SEP]Marks sentence boundaries when the input has two sentences; enables NSP (next lesson)
[PAD]Fills out a fixed-length training batch; ignored by attention via masking
[MASK]Pre-training-only artifact; covered in the next lesson
input_vector = token_embedding + position_embedding + segment_embedding
token_embedding: learned vector per WordPiece token (~30k vocab)
position_embedding: learned vector per absolute position (BERT-specific
choice; original transformer used hard-coded sinusoidal)
segment_embedding: Segment A or Segment B (NEW in BERT; tells the model
which sentence a token belongs to in two-sentence input)

Sum the three component-wise to produce the input vector for each token.

PropertyValue
Tokenizer familySubword tokenizer, same family as BPE (Phase 1)
Vocabulary size~30,000
Continuation marker## prefix on continuation pieces (BERT’s variant)
Cased vs uncasedCased preserves capitalization; uncased lowercases first. Choice depends on whether casing matters for the task.
Input: "this teddy bear is so cute."
Pre-process: "this teddy bear is so cute." (uncased: lowercase)
WordPiece: this | teddy | bear | is | so | cute | .
Add structural: [CLS] this teddy bear is so cute . [SEP]
Per-token: token_emb + position_emb + segment_emb (Segment A)
Encoder pass: every token's output embedding integrates the full context
[ Stop here. The next lesson covers what to DO with those output embeddings. ]
PhraseWhat it means
BERTThe original model (~100M parameters at BERT-base scale)
BERT-base / BERT-largeTwo size variants from the original paper
uncased / casedWhether the input is lowercased before tokenization
WordPieceThe subword tokenizer BERT uses
Hugging FaceThe de facto repository of off-the-shelf encoder models, including the entire BERT family
PitfallReality
BERT is interchangeable with GPTNo. Different architectures (encoder-only vs decoder-only), different attention masking, different uses.
BERT can do text generationNot naturally. No decoder, no autoregressive loop.
Bidirectional means two passesNo. One forward pass through the encoder; bidirectionality comes from the absence of causal masking, not from running attention twice.
Segment embedding is in every transformerNo. It is BERT-specific. The original 2017 transformer did not have it.
  • BERT (Bidirectional Encoder Representations from Transformers): encoder-only transformer with bidirectional self-attention.
  • Bidirectional self-attention: self-attention computed without a causal mask, so every token attends to every other token in both directions in one pass.
  • CLS token: special classification token at position 0 of every input; its output embedding is used by sentence-level classification heads (next lesson).
  • SEP token: special separator token marking sentence boundaries.
  • WordPiece: the tokenizer used by BERT; learns merge rules to build a ~30k vocabulary from a training corpus, in the same family as BPE.
  • Segment embedding: new in BERT; a learned vector indicating which sentence (A or B) a token belongs to.

BERT drops the decoder and removes the causal mask.
That is what makes the encoder bidirectional.
CLS, SEP, and three additive embeddings shape the input. The next lesson trains it.