References: BERT, part one: the bidirectional encoder and its structural tokens

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the architecture half of Stanford CME 295 Lecture 2's
BERT section. The next lesson covers BERT's pretraining objectives (MLM
and NSP) and the train-then-fine-tune workflow. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with
Stanford and the instructors.

Going deeper

A short list, chosen for durability.

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al., 2019. The BERT paper. The lecturer’s “around 170k citations” claim refers to this paper. Section 3 covers the architecture in detail. (Pretraining and fine-tuning live in the next lesson’s references.)
“Deep contextualized word representations”, Peters et al., 2018. The ELMo paper the lecturer mentions. Worth reading for the bidirectional-LSTM precursor approach to bidirectional representations and the contrast with what made BERT scale better.
“Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper. BERT is the encoder of this architecture, applied with different pretraining objectives (next lesson). Read section 3 for the encoder mechanics if you have not already.
Hugging Face Transformers documentation for BERT. The de facto reference for using BERT in practice. Covers tokenizers, model variants (cased, uncased, multilingual), and loadable pre-trained checkpoints.
“Subword Tokenization and the Word-Pieces Algorithm”, Schuster & Nakajima, 2012. The original WordPiece paper (Google Research). For readers who want to dig into the tokenizer BERT uses; also relevant background for our tokens lesson in Phase 1.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Single-page reference for the same material in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The BERT family (architecture variants). Most of the BERT family preserves the encoder-only, bidirectional shape and changes other dimensions (training recipe, size, attention mechanism). DeBERTa, for instance, introduces disentangled attention with separate position encoding; ALBERT shares parameters across layers for a smaller model. The next lesson covers DistilBERT (compression via distillation) and RoBERTa (better training recipe).
Cross-lingual BERT variants. mBERT (multilingual BERT, trained on 104 languages) and XLM / XLM-R (cross-lingual language models) extend the BERT recipe to multilingual settings. Loosely parallels the mT5 line in the encoder-decoder family from the previous lesson.
Sentence embeddings. BERT’s CLS embedding is sometimes used directly as a sentence embedding for similarity tasks, but the paper Sentence-BERT (Reimers & Gurevych, 2019) showed that vanilla BERT CLS embeddings are surprisingly weak for similarity; a small fine-tuning step on a similarity-shaped task produces dramatically better sentence embeddings. Worth reading if you build retrieval or semantic-search systems.
Where to go next. The next lesson covers BERT’s training: pretraining objectives (MLM and NSP), the 80/10/10 masking mix, and the train-then-fine-tune workflow.

Original sources

The primary papers, in chronological order.

“Attention Is All You Need”, Vaswani et al., 2017. The original transformer.
“Deep contextualized word representations”, Peters et al., 2018. ELMo.
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al., 2019. BERT.

Community discussion

None selected for this lesson. BERT’s architectural place in NLP is well-established and the relevant discussion has consolidated into the academic literature plus the Hugging Face documentation cycle. Durable references will be added at a future quarterly review if any consolidate.