Skip to content

References: How models know word order

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the position-embedding opening segment of Stanford CME 295
Lecture 2. The lecture's full treatment of position embeddings spans both the
original 2017 schemes (covered here) and the modern attention-injected schemes
(RoPE, T5 relative bias, ALiBi) covered in the Phase 2 lesson. Clawdemy splits
the topic across two lessons because the modern schemes require understanding
self-attention, which is taught in Phase 2. Clawdemy provides original notes,
summaries, and quizzes derived from this material for educational purposes.
All rights to the original lectures remain with Stanford and the instructors.
  • “Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper. Section 3.5 (“Positional Encoding”) is the direct source for this lesson: the learned-vs-sinusoidal discussion, the formula, the extrapolation motivation, and the comparable-performance finding that led the authors to pick sinusoidal. Read §3.5 after this lesson; the formula will already be familiar and the trade-off will already make sense. The paper is short by modern standards (about 15 pages of content) and worth reading in full.

The Phase 1 lesson deliberately stops at the 2017 answer. The modern answer (what most current LLMs actually do) is in the Phase 2 lesson:

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The position-embedding section gives a one-page reference for the same material, in the instructors’ dense visual style.

  • Sinusoidal frequency intuition. The formula uses a spectrum of frequencies across dimensions (the 10000^(2i/d_model) denominator). 3Blue1Brown’s Fourier series video is the gentlest entry point if you want to understand why a sum of sines and cosines at different frequencies can encode position richly across many scales.

None selected for this lesson. The core material is well-consolidated in the Vaswani 2017 paper and the subsequent literature covered in the Phase 2 lesson. Durable community references will be added at a future quarterly review if any consolidate.