BERT architecture: cheatsheet

The one idea that matters

BERT = Bidirectional Encoder Representations from Transformers

Drop the decoder. Keep the encoder.
Self-attention without a causal mask is bidirectional.
This lesson: architecture only. The next lesson: training.

Architecture at a glance

Property	BERT	Decoder-only (e.g., GPT-style)
Stacks	Encoder only	Decoder only
Self-attention masking	None (bidirectional)	Causal (each token sees only past)
Forward passes for bidirectional context	One	Not applicable (decoder-only is never bidirectional)
Naturally good at	Classification, embeddings, span detection	Generation, completion, chat
Cannot do naturally	Text generation	Bidirectional context integration

Structural tokens

Token	Purpose
`[CLS]`	Always at position 0 of the input; its output embedding integrates information from every token; classification heads attach on top of it (next lesson)
`[SEP]`	Marks sentence boundaries when the input has two sentences; enables NSP (next lesson)
`[PAD]`	Fills out a fixed-length training batch; ignored by attention via masking
`[MASK]`	Pre-training-only artifact; covered in the next lesson

Input embedding (three additive layers)

input_vector = token_embedding + position_embedding + segment_embedding

token_embedding:    learned vector per WordPiece token (~30k vocab)
position_embedding: learned vector per absolute position (BERT-specific
                    choice; original transformer used hard-coded sinusoidal)
segment_embedding:  Segment A or Segment B (NEW in BERT; tells the model
                    which sentence a token belongs to in two-sentence input)

Sum the three component-wise to produce the input vector for each token.

WordPiece, briefly

Property	Value
Tokenizer family	Subword tokenizer, same family as BPE (Phase 1)
Vocabulary size	~30,000
Continuation marker	`##` prefix on continuation pieces (BERT’s variant)
Cased vs uncased	Cased preserves capitalization; uncased lowercases first. Choice depends on whether casing matters for the task.

Walked example (architecture side only)

Input:           "this teddy bear is so cute."

Pre-process:     "this teddy bear is so cute."  (uncased: lowercase)

WordPiece:       this | teddy | bear | is | so | cute | .

Add structural:  [CLS] this teddy bear is so cute . [SEP]

Per-token:       token_emb + position_emb + segment_emb (Segment A)

Encoder pass:    every token's output embedding integrates the full context

[ Stop here. The next lesson covers what to DO with those output embeddings. ]

What you see in the wild

Phrase	What it means
BERT	The original model (~100M parameters at BERT-base scale)
BERT-base / BERT-large	Two size variants from the original paper
uncased / cased	Whether the input is lowercased before tokenization
WordPiece	The subword tokenizer BERT uses
Hugging Face	The de facto repository of off-the-shelf encoder models, including the entire BERT family

Pitfalls to dodge (architecture-side)

Pitfall	Reality
BERT is interchangeable with GPT	No. Different architectures (encoder-only vs decoder-only), different attention masking, different uses.
BERT can do text generation	Not naturally. No decoder, no autoregressive loop.
Bidirectional means two passes	No. One forward pass through the encoder; bidirectionality comes from the absence of causal masking, not from running attention twice.
Segment embedding is in every transformer	No. It is BERT-specific. The original 2017 transformer did not have it.

Glossary

BERT (Bidirectional Encoder Representations from Transformers): encoder-only transformer with bidirectional self-attention.
Bidirectional self-attention: self-attention computed without a causal mask, so every token attends to every other token in both directions in one pass.
CLS token: special classification token at position 0 of every input; its output embedding is used by sentence-level classification heads (next lesson).
SEP token: special separator token marking sentence boundaries.
WordPiece: the tokenizer used by BERT; learns merge rules to build a ~30k vocabulary from a training corpus, in the same family as BPE.
Segment embedding: new in BERT; a learned vector indicating which sentence (A or B) a token belongs to.

BERT drops the decoder and removes the causal mask.
That is what makes the encoder bidirectional.
CLS, SEP, and three additive embeddings shape the input. The next lesson trains it.