Summary: BERT, part one: the bidirectional encoder and its structural tokens
BERT is the encoder-only branch’s defining model. Drop the decoder, keep the encoder. Self-attention without a causal mask is bidirectional: every token attends to every other token in one pass. Two structural tokens shape the input (CLS at position 0, SEP between sentences). Three additive embeddings (token + position + segment, the last being new in BERT). This summary covers the architecture only; the next lesson covers how BERT is trained.
This summary is the scan-it-in-five-minutes version. The full lesson walks the architecture, the structural tokens, the three input embeddings, the WordPiece tokenizer, and a walked example from raw text to context-aware token representations.
Core ideas
Section titled “Core ideas”- BERT = Bidirectional Encoder Representations from Transformers. Encoder-only architecture. Self-attention is computed without a causal mask, so token at position 5 can attend to positions 1, 2, 3, 4 (past) and 6, 7, 8 (future) in one pass.
- Bidirectionality matters because of what it enables. A model that can see the whole input produces context-aware token representations useful for classification, embeddings, and span-detection tasks. Decoder-only models (GPT-style) cannot do this naturally because their attention is causal.
- Bidirectionality is one forward pass, not two. The model is not running attention twice. The bidirectionality comes from the absence of a causal mask, not from running attention forward and backward.
- CLS and SEP are structural tokens. CLS is added at position 0 of every input; its output embedding is used as a sentence-level (or sentence-pair-level) representation by classification heads (covered in the next lesson). SEP marks sentence boundaries when the input has two sentences.
- Three additive input embeddings. Token (looked up from a ~30k WordPiece vocabulary) + position (learned, one per absolute position) + segment (new in BERT; Segment A vs Segment B). Summed component-wise.
- WordPiece, ~30k vocab. Cased and uncased variants exist (the choice depends on whether casing matters for your task).
- Walked example. “this teddy bear is so cute.” In BERT: pre-process (lowercase for uncased), WordPiece tokenize, prepend
[CLS]and append[SEP], compute three embeddings per token and sum, run through the encoder. End result: context-aware output embeddings, one per input position. This lesson stops here; the next lesson covers what training and fine-tuning do with those embeddings. - ELMo aside. A concurrent paper that also pursued bidirectional representations via bidirectional LSTMs. Lost steam to BERT because LSTMs are harder to scale than transformers. (Both are Sesame Street characters.)
- Pitfall: conflating BERT with GPT. Different architectures (encoder-only vs decoder-only), different attention masking, different uses.
- Pitfall: thinking BERT can generate text. It cannot naturally; no decoder, no autoregressive loop.
- Pitfall: thinking “bidirectional” means two passes. It does not. One forward pass through the encoder; bidirectionality comes from removing the causal mask.
- Pitfall: forgetting the segment embedding is BERT-specific. It exists because BERT’s input shape can include two sentences. The original 2017 transformer did not have it.
What changes for you
Section titled “What changes for you”When you see “BERT,” “BERT-base,” “RoBERTa,” “DistilBERT,” or any of the dozens of BERT-family models on Hugging Face or in research papers, you now know what the architecture is doing. You know what CLS and SEP are for. You know that the input is shaped by three additive embeddings, with segment being the BERT-specific addition. The next lesson covers what BERT was trained to do (pretraining objectives MLM and NSP) and how a pre-trained encoder gets adapted to a specific task (fine-tuning patterns).
BERT drops the decoder and removes the causal mask.
That is what makes the encoder bidirectional.
CLS, SEP, and three additive embeddings shape the input. The next lesson trains it.