RoPE position embeddings: brief

What you’ll learn

This is the Phase 2 closer to the position-embeddings story. Phase 1 ended with the 2017 transformer paper’s two original schemes (learned and sinusoidal) added to the input embedding. This lesson is about what changed when the field moved position information out of the input and into the attention computation itself. We cover the structural shift first (why “in the attention math” is cleaner than “added to the input and hope the property survives”), then the two intermediate schemes (T5 relative bias and ALiBi, both adding a bias inside the softmax), then the modern winner: RoPE, which rotates the query and key vectors by an angle that depends on their position. We build the RoPE intuition from the 2D rotation matrix up. By the end you can read “uses RoPE” on a model card and know exactly what that architectural choice is doing.

Where this fits

This is the fourth lesson of Phase 2 (How models think, the transformer architecture) and the first of three lessons on what genuinely changed between the original 2017 transformer and modern LLMs. The other two are normalization (LayerNorm to RMSNorm and the pre-norm shift) and attention efficiency tricks (sliding windows, KV cache, MQA/GQA). The previous lesson in Phase 2 was The transformer block; the next is Layer norm and RMSNorm.

Before you start

Prerequisites: the Phase 1 position lesson (you need the “why position info is needed” framing and the original sinusoidal/learned schemes); and the attention lesson (you need to understand what query, key, and value vectors are, and what Q · K^T represents in self-attention). The structural shift only makes sense if you already know what attention is doing.

By the end, you’ll be able to

Identify the structural shift between input-added position embeddings and attention-injected ones, and explain why the field made it
Describe what T5 relative bias and ALiBi each contribute to the attention softmax, and what each gives up
Walk through what RoPE (rotary position embeddings) actually does in plain language, including the 2D rotation intuition
Explain why most modern LLMs picked RoPE over the alternatives

Time and difficulty

Read time: about 20 minutes
Practice time: about 15 minutes (a 2D rotation walk-through that shows why dot products of rotated vectors depend on relative position)
Difficulty: standard