Practice: How modern models inject position into attention (RoPE)

Self-check

A short retrieval pass. Answer in your head (or on paper) before opening the collapsible.

1. Phase 1 covered the original 2017 answer: adding a sinusoidal or learned position vector to the input. What was the structural shift that motivated modern schemes?

Show answer

What we actually care about is making closer tokens more similar than distant tokens in the attention computation, where the token comparison happens. Adding position to the input is indirect: the position information has to flow through the network for the attention computation downstream to reflect it, and there is no guarantee the property survives every layer cleanly. A cleaner approach is to put the position signal directly into the attention math. That is the shift the field made.

2. T5 relative bias and ALiBi sit between sinusoidal and RoPE on the timeline. What did each do?

Show answer

T5 relative bias added a learned scalar bias b(m, n) inside the attention softmax: softmax(QK^T/sqrt(d) + b). The bias was bucketized by relative distance and learned during training. Worked, but kept the learned-parameter overfitting risk.

ALiBi (Attention with Linear Bias) did the same thing deterministically. The bias was a simple linear function of relative distance, with a hard-coded slope per attention head. No learnable parameters. Cheap; never became dominant.

3. In one paragraph, what does RoPE actually do?

Show answer

RoPE rotates the query and key vectors by an angle that depends on their position. The query for the token at position m is rotated by angle θ_m; the key for the token at position n is rotated by angle θ_n. When you compute the dot product of the rotated vectors, the result contains a factor R(θ_n - θ_m) and depends only on the relative angle between the rotations, which is itself a function of the relative position n - m. Same property the sinusoidal embeddings had (relative-distance-aware similarity), but now baked directly into the attention dot product instead of injected at the input layer.

4. Why did RoPE win over the alternatives?

Show answer

Three reasons. (1) The property is right. Relative-distance-aware similarity falls out of the math, not out of learned parameters that can overfit. (2) The math has a long-term decay. A formal upper bound on the attention weight as a function of relative distance shows it shrinks (with some oscillation) as positions get further apart, matching the intuition we wanted. (3) It is cheap and extensible. Two element-wise multiplications per token per attention layer, no learnable parameters, and the rotation formula works for any sequence length.

5. What is the most common pitfall in distinguishing sinusoidal embeddings (Phase 1) from RoPE?

Show answer

Conflating them because both use sines and cosines. The math overlaps; the architectural placement is completely different. Sinusoidal embeddings are vectors added to the input before the first attention layer (Phase 1 material). RoPE is a rotation applied to the query and key vectors inside every attention layer. Same trigonometric machinery, different place in the architecture.

Try it yourself: 2D rotation walk-through

This exercise puts the RoPE intuition into practice. About 15 minutes.

Side effects: none. Pen and paper, or a text editor.

Setup: you have a 2D query vector q = (1, 0) (a unit vector pointing along the positive x-axis). The token sits at position m. You want to apply RoPE with rotation angle θ_m = m * (π / 4) (so position 1 rotates by 45°, position 2 rotates by 90°, position 3 rotates by 135°, and so on).

The rotation matrix is:

R(θ) = | cos θ   -sin θ |
       | sin θ    cos θ |

Part one: rotate the query at three different positions.

Compute the rotated q at positions m = 1, m = 2, and m = 4. Show the output as a 2D vector for each. (Useful values: cos(π/4) = sin(π/4) ≈ 0.707; cos(π/2) = 0, sin(π/2) = 1; cos(π) = -1, sin(π) = 0.)

Show answer

At m = 1 (rotation angle π/4, or 45°): R(π/4) · (1, 0) = (cos(π/4), sin(π/4)) = (0.707, 0.707). The vector now points at 45° above the x-axis.

At m = 2 (rotation angle π/2, or 90°): R(π/2) · (1, 0) = (cos(π/2), sin(π/2)) = (0, 1). The vector now points straight up the y-axis.

At m = 4 (rotation angle π, or 180°): R(π) · (1, 0) = (cos(π), sin(π)) = (-1, 0). The vector now points along the negative x-axis.

The rotation preserves magnitude (all three rotated vectors have length 1, same as the original) and changes only direction.

Part two: dot product depends on relative position.

Now suppose you have a key vector k = (1, 0) (same direction as the original query). Apply RoPE to k at position n, using the same angle formula θ_n = n * (π / 4).

Compute the dot product of the rotated q (at position m) with the rotated k (at position n) for the following pairs: (m=1, n=2), (m=3, n=4), (m=10, n=11), and (m=1, n=5).

Show answer

For two unit vectors that started in the same direction, after RoPE the dot product is cos(θ_n - θ_m) = cos((n - m) * π / 4). The dot product depends only on the relative position n - m, not on the absolute values.

(m=1, n=2): relative position 1, dot product = cos(π/4) ≈ 0.707
(m=3, n=4): relative position 1, dot product = cos(π/4) ≈ 0.707 (same as above)
(m=10, n=11): relative position 1, dot product = cos(π/4) ≈ 0.707 (still the same)
(m=1, n=5): relative position 4, dot product = cos(π) = -1 (much further apart, much lower similarity)

The first three pairs are all “one position apart” and produce identical similarity, regardless of where in the sequence they sit. The fourth pair is four positions apart and produces a much lower (negative) similarity. This is the relative-distance-aware property RoPE is designed to give you, falling directly out of the trigonometric identity.

Sanity check: the goal of this exercise is to feel the rotation property in your hands. Once you have computed three rotations and the dot products at four different position pairs, the abstract “RoPE makes attention depend on relative position” claim becomes mechanical.

Flashcards

Ten cards covering the structural shift and RoPE. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set one card per page.

Q. What is the structural shift from original position-embedding schemes to modern ones?

The original schemes (sinusoidal, learned) added position information to the input embedding before the first attention layer. Modern schemes (T5 relative bias, ALiBi, RoPE) inject it directly into the attention computation, where the relative-similarity property actually needs to show up. Indirect input-side addition → direct attention-side injection.

Q. What does T5 relative bias do?

Adds a learned scalar bias b(m, n) inside the attention softmax: softmax(QK^T/sqrt(d) + b). The bias is bucketized by relative distance and learned during training. Works; carries the same overfitting risk as learned input-position embeddings.

Q. What does ALiBi do?

Same as T5 but deterministic. The bias inside the softmax is a simple linear function of relative distance, with a hard-coded slope per attention head. No learnable parameters, cheap, extrapolates well. Earned a place in the literature; never became dominant.

Q. In one sentence, what is RoPE?

Rotary Position Embeddings: rotate the query and key vectors by position-dependent angles inside the attention computation, so the dot product of the rotated vectors depends on the relative position between the tokens.

Q. What does the rotation matrix do in 2D?

The matrix R(θ) = [[cos θ, -sin θ], [sin θ, cos θ]] rotates a 2D vector by angle θ while preserving its length. Magnitude unchanged, direction shifted. Multiplying a vector in polar form r * (cos φ, sin φ) by R(θ) gives r * (cos(φ + θ), sin(φ + θ)).

Q. How does RoPE handle vectors with more than 2 dimensions?

It applies the rotation block by block. Split the high-dimensional vector into 2D pieces (a 768-dim vector becomes 384 separate 2D blocks); rotate each block by its own angle. The angles typically follow a frequency pattern related to the sinusoidal 10000^(2i/d_model) formula, so low-index blocks rotate quickly with position and high-index blocks rotate slowly.

Q. Why did RoPE win over the alternatives?

The property is right (relative-distance-aware similarity, by construction). No learnable parameters that can overfit. Cheap implementation (two element-wise multiplications per token per attention layer). Extends to any sequence length. The math has a long-term decay built into the upper bound on attention weights, matching the intuition we wanted.

Q. What is the most common pitfall in distinguishing sinusoidal embeddings from RoPE?

Conflating them because both use sines and cosines. Sinusoidal (Phase 1) adds vectors to the input before the first attention layer. RoPE rotates vectors inside every attention layer. Same trigonometric machinery, completely different architectural placement.

Q. Does RoPE have learnable parameters?

No. The rotation angles are deterministic functions of position and dimension index, set by the same frequency formula sinusoidal used. Nothing about the position-embedding component itself is trained.

Q. What are all these schemes trying to encode, at root?

Relative-distance-aware similarity in the attention computation: closer tokens should attend to each other more than distant tokens, by construction. Sinusoidal, T5 relative bias, ALiBi, RoPE all aim at the same property; they differ in where they put the signal and whether it requires learning.