Summary: How modern models inject position into attention (RoPE)
Phase 1 covered the original answer. The 2017 transformer paper put position information into the model by adding a position vector (sinusoidal) to the token embedding before the first attention layer. That worked. It is still in the textbooks. Most modern LLMs do something different.
The structural shift: what we actually care about is making closer tokens more similar than distant tokens in the attention computation. Adding position to the input is indirect; injecting it into the attention math is direct. The field made that shift.
This summary is the scan-it-in-five-minutes version. The full lesson derives the structural shift from first principles, covers the two intermediate schemes (T5 relative bias and ALiBi), and builds RoPE from the 2D rotation matrix up.
Core ideas
Section titled “Core ideas”- The load-bearing intuition. Closer tokens should be more similar than distant tokens, in the attention computation. Every position-embedding scheme aims at that property; they differ in how they achieve it and where they put the signal.
- The structural shift. The original sinusoidal scheme adds position to the input and hopes the property propagates through the network. Modern schemes inject the signal directly into the attention computation, at
Q · K^T, where the comparison between tokens actually happens. - Intermediate scheme 1, T5 relative bias. Add a learned scalar bias
b(m, n)inside the attention softmax:softmax(QK^T/sqrt(d) + b). Bias is bucketized by relative distance and learned during training. Works; carries the same overfitting risk as learned input-position embeddings. - Intermediate scheme 2, ALiBi. Same idea, deterministic. The bias is a simple linear function of relative distance, with a hard-coded slope per attention head. No learnable parameters. Cheap; never became dominant.
- The modern winner, RoPE (rotary position embeddings). Rotate the query and key vectors by an angle that depends on their position. The dot product of two rotated vectors depends on relative position, by the same trigonometric machinery the sinusoidal scheme used in Phase 1 (the
cos(a-b)identity). The position signal is now baked directly into the attention dot product instead of injected at the input. - The 2D intuition. Multiplying a 2D vector by the rotation matrix
R(θ) = [[cos θ, -sin θ], [sin θ, cos θ]]rotates it by angleθwhile preserving its length. RoPE rotates the query at positionmby angleθ_mand the key at positionnby angleθ_n; the dot product of the rotated vectors contains a factorR(θ_n - θ_m)and depends only on the gap between rotations, not their absolute values. - Beyond 2D. Real query and key vectors are high-dimensional. RoPE applies the rotation block by block, in 2D pieces of the larger vector. Each block rotates by a different angle following a frequency pattern related to the original sinusoidal
10000^(2i/d_model)formula. - Why RoPE won. Right property (relative-distance-aware similarity by construction), no learned parameters that can overfit, cheap implementation (two element-wise multiplications per token per attention layer), extends to any sequence length, long-term decay built into the upper bound on attention weights.
- Pitfall: conflating sinusoidal and RoPE. Both use sines and cosines. They are not the same scheme. Sinusoidal (Phase 1) adds vectors to the input; RoPE (this lesson) rotates vectors inside the attention computation. Same trig machinery, different architectural placement.
- Pitfall: thinking RoPE has learned parameters. It does not. The rotation angles are deterministic functions of position and dimension index.
What changes for you
Section titled “What changes for you”After Phase 1, you knew how the 2017 transformer put position info in. After this lesson, you know what the field changed and why. When a model card mentions “uses RoPE,” you know that means position is rotated into the attention computation itself: not added to the input, not learned from data, just baked into the math at the layer where it matters. The next two lessons in this phase cover the other two places where modern LLMs genuinely diverge from 2017: normalization (LayerNorm to RMSNorm and the pre-norm shift) and attention efficiency tricks (sliding windows and the MQA/GQA progression).
Sinusoidal embeddings add position to the input.
RoPE rotates position into the attention itself.
The second is what modern LLMs actually use.