Skip to content

BERT, part one: the bidirectional encoder and its structural tokens

This is lesson 8 of Phase 2, How models think: the transformer architecture, in Track 5 (AI Foundations). The previous lesson covered the encoder-decoder branch (T5 and span corruption). This lesson is the architecture half of BERT (Bidirectional Encoder Representations from Transformers): drop the decoder, remove the causal mask so self-attention is bidirectional, and shape the input with two structural tokens (CLS, SEP) plus three additive embeddings (token + position + segment, the last being new in BERT). The lesson stops once context-aware token representations come out of the encoder; the next lesson covers what BERT was trained to do with them.

This is lesson 8 of Phase 2, How models think: the transformer architecture. BERT is a single mental object split across two consecutive lessons in this phase. The previous lesson (How transformers turn input into output: encoder-decoder and T5’s span corruption) covered the encoder-decoder branch. This lesson covers BERT’s architecture only. The next lesson, BERT, part two: pretraining objectives and the train-then-fine-tune workflow, covers how BERT is trained. The phase closes with the lesson on BERT derivatives.

Prerequisites: the transformer block lesson, the multi-head attention lesson, and the encoder-decoder and T5 lesson are required. We assume you understand what an encoder is, what causal masking does to attention, and what an embedding lookup looks like.

  • Explain what makes BERT bidirectional and how that contrasts with the causal masking in decoder-only models
  • Identify the role of the structural tokens (CLS, SEP) and the segment encoding in BERT’s input
  • Walk through the three additive input embeddings (token + position + segment) and what each one carries
  • Describe what comes out of the encoder (context-aware token representations) and recognize that “what to do with them” is the next lesson’s territory
  • Read time: about 13 minutes
  • Practice time: about 12 minutes (a walked-example tokenization through the BERT input pipeline plus an architecture-spotting exercise across four task scenarios)
  • Difficulty: standard