Skip to content

References: BERT, part one: the bidirectional encoder and its structural tokens

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the architecture half of Stanford CME 295 Lecture 2's
BERT section. The next lesson covers BERT's pretraining objectives (MLM
and NSP) and the train-then-fine-tune workflow. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

A short list, chosen for durability.

Topics that build on or sit beside this one.

  • The BERT family (architecture variants). Most of the BERT family preserves the encoder-only, bidirectional shape and changes other dimensions (training recipe, size, attention mechanism). DeBERTa, for instance, introduces disentangled attention with separate position encoding; ALBERT shares parameters across layers for a smaller model. The next lesson covers DistilBERT (compression via distillation) and RoBERTa (better training recipe).

  • Cross-lingual BERT variants. mBERT (multilingual BERT, trained on 104 languages) and XLM / XLM-R (cross-lingual language models) extend the BERT recipe to multilingual settings. Loosely parallels the mT5 line in the encoder-decoder family from the previous lesson.

  • Sentence embeddings. BERT’s CLS embedding is sometimes used directly as a sentence embedding for similarity tasks, but the paper Sentence-BERT (Reimers & Gurevych, 2019) showed that vanilla BERT CLS embeddings are surprisingly weak for similarity; a small fine-tuning step on a similarity-shaped task produces dramatically better sentence embeddings. Worth reading if you build retrieval or semantic-search systems.

  • Where to go next. The next lesson covers BERT’s training: pretraining objectives (MLM and NSP), the 80/10/10 masking mix, and the train-then-fine-tune workflow.

The primary papers, in chronological order.

None selected for this lesson. BERT’s architectural place in NLP is well-established and the relevant discussion has consolidated into the academic literature plus the Hugging Face documentation cycle. Durable references will be added at a future quarterly review if any consolidate.