References: How modern models inject position into attention (RoPE)

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 2, Transformer-based models & tricks):
    https://www.youtube.com/watch?v=yT84Y5zCnaA
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the modern position-embedding section of Stanford CME 295
Lecture 2 (T5 relative bias, ALiBi, and RoPE). The original 2017 schemes
(sinusoidal and learned input-position embeddings) are covered in the Phase 1
lesson "How models know word order," which is a prerequisite for this one.
Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lectures remain
with Stanford and the instructors.

The original answer (Phase 1)

This lesson picks up where the Phase 1 lesson leaves off. If you want the primary source for the 2017 sinusoidal and learned position-embedding schemes, see the references for the prerequisite lesson:

References: How models know word order covers Vaswani et al. 2017 §3.5 (sinusoidal and learned embeddings) and the sinusoidal frequency intuition.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

“RoFormer: Enhanced Transformer with Rotary Position Embedding”, Su et al., 2021. The RoPE paper. Section 3 derives the result that the dot product of rotated vectors depends on relative position; the appendix has the long-term-decay upper bound the lesson mentions. Read after the RoPE section in this lesson; the math will be more accessible than it looks because you already know what it is trying to prove.
“Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”, Press et al., 2021. The ALiBi paper. The empirical demonstration that a deterministic linear bias inside the attention softmax extrapolates better than learned position embeddings. Worth reading for the contrast with RoPE (which won the broader race despite ALiBi being earlier).
“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, Raffel et al., 2020. The T5 paper. The relative-position-bias scheme is described in section 2.1; you can skip the rest unless you want the full T5 context (covered in a later lesson in this phase).
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The position-embedding section of the cheatsheet gives a one-page reference for the same material this lesson covers, in their dense visual style.

Adjacent topics

Topics that build on or sit beside this one.

The geometric intuition for high-dimensional rotations. RoPE rotates query and key block-by-block through 2D pieces of high-dimensional vectors. The intuition for why this composes cleanly comes from the math of orthogonal matrices, of which rotations are a special case. Search terms: “block-diagonal rotation matrix,” “SO(2) representation.”
Where to go next. The next lesson in this phase covers layer normalization (the second of the three places where modern transformers genuinely diverge from the 2017 paper). After that, attention efficiency tricks (sliding window attention and the MHA to MQA to GQA progression). Both build on the architecture knowledge from Phase 1.

Original sources

The primary papers for the schemes covered in this lesson, in chronological order.

“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, Raffel et al., 2020. T5; relative position bias.
“RoFormer: Enhanced Transformer with Rotary Position Embedding”, Su et al., 2021. RoPE.
“Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”, Press et al., 2021. ALiBi.

Community discussion

None selected for this lesson. The public discussion of position embeddings is mostly consolidated in the academic literature above. Durable references will be added here at a future quarterly review if any consolidate.