Cheatsheet: How modern models inject position into attention (RoPE)
The one idea that matters
Section titled “The one idea that matters”What we care about: closer tokens more similar than distant tokens,in the attention computation.
Original (Phase 1): add position to the input. Indirect.Modern (this lesson): rotate position into the attention itself. Direct.The structural shift
Section titled “The structural shift”| Era | Approach | Why we moved on |
|---|---|---|
| 2017 transformer (Phase 1) | Add position vector to input embedding | Position info must flow through every layer for attention to reflect it. No guarantee the property survives cleanly. |
| Modern LLMs | Inject position into the attention computation directly (T5, ALiBi, RoPE) | What we care about is similarity behavior at the attention layer; put the signal there. |
The three modern schemes
Section titled “The three modern schemes”| Scheme | Where it sits | Learnable? | Property | Used today? |
|---|---|---|---|---|
| T5 relative bias | Added inside attention softmax | Yes | Learned scalar bias b(m, n) bucketized by relative distance | Some encoder-decoder models |
| ALiBi | Added inside attention softmax | No | Deterministic linear bias by relative distance, slope hard-coded per head | Some open-source models |
| RoPE (rotary) | Rotates query and key inside attention | No | Dot product of rotated vectors depends only on relative position n - m | Most modern LLMs |
T5 relative bias, in brief
Section titled “T5 relative bias, in brief”softmax( QK^T / sqrt(d) + b(m, n) )The bias b(m, n) is a learned scalar that depends only on relative distance m - n. Relative distances are bucketized (all distances 5-8 share one bias, 9-16 share the next, etc.). Works; carries the same overfitting risk as learned input-position embeddings.
ALiBi, in brief
Section titled “ALiBi, in brief”Same structure as T5 relative bias, but the bias is deterministic: a simple linear penalty proportional to relative distance, with a hard-coded slope per attention head. No learned parameters. Cheap; extrapolates well; never became dominant.
RoPE, in one paragraph
Section titled “RoPE, in one paragraph”Rotate the query for the token at position m by angle θ_m. Rotate the key for the token at position n by angle θ_n. The dot product of the rotated q and rotated k contains a factor R(θ_n - θ_m), the rotation matrix of the relative angle. The result depends only on n - m, not on absolute positions.
The 2D rotation matrix
Section titled “The 2D rotation matrix”R(θ) = | cos θ -sin θ | | sin θ cos θ || Property | Why it matters |
|---|---|
| Preserves vector magnitude | RoPE does not change “how big” the query/key vectors are, only their direction |
Angles add under composition: R(α) · R(β) = R(α + β) | The dot-product-of-rotated-vectors property falls out of this |
| Extends to higher dimensions block-by-block | Real query/key vectors split into 2D blocks; each rotates independently at its own frequency |
Beyond 2D: block-by-block rotation
Section titled “Beyond 2D: block-by-block rotation”Real query and key vectors are high-dimensional (hundreds to thousands of dimensions). RoPE splits them into 2D blocks and applies a rotation to each block independently. Each block uses a different angle following a frequency pattern:
ω_i = 10000^( -2i / d_model )θ_m,i = m · ω_iLow-index blocks rotate quickly with position; high-index blocks rotate slowly. The same frequency intuition as the sinusoidal formula from Phase 1, now applied inside the attention rotation.
Why RoPE became dominant
Section titled “Why RoPE became dominant”| Property | Detail |
|---|---|
| Right behavior | Relative-distance-aware similarity, by construction |
| No learned parameters | Cannot overfit to training-set positional patterns |
| Cheap | Two element-wise multiplications per token per attention layer |
| Extends to any sequence length | Angles are deterministic functions of position |
| Long-term decay built in | Upper bound on attention weight shrinks with relative distance (with some oscillation) |
Pitfalls to dodge
Section titled “Pitfalls to dodge”| Pitfall | Reality |
|---|---|
| Sinusoidal and RoPE are the same because both use sin/cos | No. Sinusoidal (Phase 1) adds vectors to the input; RoPE rotates vectors inside the attention computation. Same trig machinery, completely different architectural placement. |
| RoPE has learnable parameters | No. The rotation angles are deterministic functions of position and dimension index. |
| All modern LLMs use RoPE | Most do; not all. Some still use ALiBi or older schemes. Check the model card. |
| T5 and ALiBi failed | They worked. They just did not become dominant. RoPE won on the combination of right-behavior, no parameters, and cheap implementation. |
Translating model-card language
Section titled “Translating model-card language”| Model-card phrase | What it means |
|---|---|
| ”Uses RoPE” | Position info is rotated into attention via query/key rotation; relative-distance-aware similarity by construction |
| ”ALiBi position embeddings” | Deterministic linear bias added inside the attention softmax; no learned parameters |
| ”T5 relative position bias” | Learned scalar bias added inside the attention softmax, bucketized by relative distance |
| ”Learned absolute position embeddings” | Older scheme (Phase 1); one trainable vector per position added to input; bounded by training-set max sequence length |
Glossary
Section titled “Glossary”- Position embedding (or position encoding): any scheme for telling the model where each token sits in the sequence.
- Structural shift: the move from adding position to the input (Phase 1) to injecting it directly into the attention computation (this lesson).
- T5 relative bias: learned scalar bias added inside the attention softmax, bucketized by relative distance.
- ALiBi (Attention with Linear Bias): deterministic linear bias added inside the attention softmax, slope hard-coded per attention head.
- RoPE (Rotary Position Embeddings): rotation of the query and key vectors by position-dependent angles inside the attention computation.
- Rotation matrix: the 2D matrix
[[cos θ, -sin θ], [sin θ, cos θ]]that rotates a 2D vector by angleθwhile preserving its magnitude. - Relative position: the gap between two tokens (
n - m), as opposed to their absolute positions in the sequence. - Long-term decay: the property that the attention weight upper bound shrinks (with some oscillation) as relative distance grows; built into RoPE by construction.
Sinusoidal embeddings add position to the input.
RoPE rotates position into the attention itself.
The second is what modern LLMs actually use.