References: How models know word order
Source material
Section titled “Source material”Source material:• Stanford CME 295: Transformers & Large Language Models, Autumn 2025 Instructor: Afshine Amidi & Shervine Amidi, Stanford University Course site: https://cme295.stanford.edu/ Cheatsheet: https://cme295.stanford.edu/cheatsheet/ Source lecture (Lecture 2, Transformer-based models & tricks): https://www.youtube.com/watch?v=yT84Y5zCnaA License (lecture videos): as published on Stanford's public YouTube channel License (Amidi cheatsheets): MITThis lesson adapts the position-embedding opening segment of Stanford CME 295Lecture 2. The lecture's full treatment of position embeddings spans both theoriginal 2017 schemes (covered here) and the modern attention-injected schemes(RoPE, T5 relative bias, ALiBi) covered in the Phase 2 lesson. Clawdemy splitsthe topic across two lessons because the modern schemes require understandingself-attention, which is taught in Phase 2. Clawdemy provides original notes,summaries, and quizzes derived from this material for educational purposes.All rights to the original lectures remain with Stanford and the instructors.Primary source
Section titled “Primary source”- “Attention Is All You Need”, Vaswani et al., 2017. The original transformer paper. Section 3.5 (“Positional Encoding”) is the direct source for this lesson: the learned-vs-sinusoidal discussion, the formula, the extrapolation motivation, and the comparable-performance finding that led the authors to pick sinusoidal. Read §3.5 after this lesson; the formula will already be familiar and the trade-off will already make sense. The paper is short by modern standards (about 15 pages of content) and worth reading in full.
Where the story continues
Section titled “Where the story continues”The Phase 1 lesson deliberately stops at the 2017 answer. The modern answer (what most current LLMs actually do) is in the Phase 2 lesson:
- How modern models inject position into attention (RoPE) covers the structural shift from input-added to attention-injected position schemes, the two intermediate steps (T5 relative bias and ALiBi), and the RoPE deep-dive. Requires the attention lesson as a prerequisite.
Adjacent topics
Section titled “Adjacent topics”-
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The position-embedding section gives a one-page reference for the same material, in the instructors’ dense visual style.
-
Sinusoidal frequency intuition. The formula uses a spectrum of frequencies across dimensions (the
10000^(2i/d_model)denominator). 3Blue1Brown’s Fourier series video is the gentlest entry point if you want to understand why a sum of sines and cosines at different frequencies can encode position richly across many scales.
Community discussion
Section titled “Community discussion”None selected for this lesson. The core material is well-consolidated in the Vaswani 2017 paper and the subsequent literature covered in the Phase 2 lesson. Durable community references will be added at a future quarterly review if any consolidate.