Skip to content

References: How modern models inject position into attention (RoPE)

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 2, Transformer-based models & tricks):
https://www.youtube.com/watch?v=yT84Y5zCnaA
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the modern position-embedding section of Stanford CME 295
Lecture 2 (T5 relative bias, ALiBi, and RoPE). The original 2017 schemes
(sinusoidal and learned input-position embeddings) are covered in the Phase 1
lesson "How models know word order," which is a prerequisite for this one.
Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lectures remain
with Stanford and the instructors.

This lesson picks up where the Phase 1 lesson leaves off. If you want the primary source for the 2017 sinusoidal and learned position-embedding schemes, see the references for the prerequisite lesson:

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

Topics that build on or sit beside this one.

  • The geometric intuition for high-dimensional rotations. RoPE rotates query and key block-by-block through 2D pieces of high-dimensional vectors. The intuition for why this composes cleanly comes from the math of orthogonal matrices, of which rotations are a special case. Search terms: “block-diagonal rotation matrix,” “SO(2) representation.”

  • Where to go next. The next lesson in this phase covers layer normalization (the second of the three places where modern transformers genuinely diverge from the 2017 paper). After that, attention efficiency tricks (sliding window attention and the MHA to MQA to GQA progression). Both build on the architecture knowledge from Phase 1.

The primary papers for the schemes covered in this lesson, in chronological order.

None selected for this lesson. The public discussion of position embeddings is mostly consolidated in the academic literature above. Durable references will be added here at a future quarterly review if any consolidate.