Skip to content

How modern models inject position into attention (RoPE)

This is the Phase 2 closer to the position-embeddings story. Phase 1 ended with the 2017 transformer paper’s two original schemes (learned and sinusoidal) added to the input embedding. This lesson is about what changed when the field moved position information out of the input and into the attention computation itself. We cover the structural shift first (why “in the attention math” is cleaner than “added to the input and hope the property survives”), then the two intermediate schemes (T5 relative bias and ALiBi, both adding a bias inside the softmax), then the modern winner: RoPE, which rotates the query and key vectors by an angle that depends on their position. We build the RoPE intuition from the 2D rotation matrix up. By the end you can read “uses RoPE” on a model card and know exactly what that architectural choice is doing.

This is the fourth lesson of Phase 2 (How models think, the transformer architecture) and the first of three lessons on what genuinely changed between the original 2017 transformer and modern LLMs. The other two are normalization (LayerNorm to RMSNorm and the pre-norm shift) and attention efficiency tricks (sliding windows, KV cache, MQA/GQA). The previous lesson in Phase 2 was The transformer block; the next is Layer norm and RMSNorm.

Prerequisites: the Phase 1 position lesson (you need the “why position info is needed” framing and the original sinusoidal/learned schemes); and the attention lesson (you need to understand what query, key, and value vectors are, and what Q · K^T represents in self-attention). The structural shift only makes sense if you already know what attention is doing.

  • Identify the structural shift between input-added position embeddings and attention-injected ones, and explain why the field made it
  • Describe what T5 relative bias and ALiBi each contribute to the attention softmax, and what each gives up
  • Walk through what RoPE (rotary position embeddings) actually does in plain language, including the 2D rotation intuition
  • Explain why most modern LLMs picked RoPE over the alternatives
  • Read time: about 20 minutes
  • Practice time: about 15 minutes (a 2D rotation walk-through that shows why dot products of rotated vectors depend on relative position)
  • Difficulty: standard