BERT, part one: the bidirectional encoder and its structural tokens
What you’ll learn
Section titled “What you’ll learn”This is lesson 8 of Phase 2, How models think: the transformer architecture, in Track 5 (AI Foundations). The previous lesson covered the encoder-decoder branch (T5 and span corruption). This lesson is the architecture half of BERT (Bidirectional Encoder Representations from Transformers): drop the decoder, remove the causal mask so self-attention is bidirectional, and shape the input with two structural tokens (CLS, SEP) plus three additive embeddings (token + position + segment, the last being new in BERT). The lesson stops once context-aware token representations come out of the encoder; the next lesson covers what BERT was trained to do with them.
Where this fits
Section titled “Where this fits”This is lesson 8 of Phase 2, How models think: the transformer architecture. BERT is a single mental object split across two consecutive lessons in this phase. The previous lesson (How transformers turn input into output: encoder-decoder and T5’s span corruption) covered the encoder-decoder branch. This lesson covers BERT’s architecture only. The next lesson, BERT, part two: pretraining objectives and the train-then-fine-tune workflow, covers how BERT is trained. The phase closes with the lesson on BERT derivatives.
Before you start
Section titled “Before you start”Prerequisites: the transformer block lesson, the multi-head attention lesson, and the encoder-decoder and T5 lesson are required. We assume you understand what an encoder is, what causal masking does to attention, and what an embedding lookup looks like.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain what makes BERT bidirectional and how that contrasts with the causal masking in decoder-only models
- Identify the role of the structural tokens (CLS, SEP) and the segment encoding in BERT’s input
- Walk through the three additive input embeddings (token + position + segment) and what each one carries
- Describe what comes out of the encoder (context-aware token representations) and recognize that “what to do with them” is the next lesson’s territory
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 12 minutes (a walked-example tokenization through the BERT input pipeline plus an architecture-spotting exercise across four task scenarios)
- Difficulty: standard