RoPE position embeddings: cheatsheet

The one idea that matters

What we care about: closer tokens more similar than distant tokens,
in the attention computation.

Original (Phase 1): add position to the input. Indirect.
Modern (this lesson): rotate position into the attention itself. Direct.

The structural shift

Era	Approach	Why we moved on
2017 transformer (Phase 1)	Add position vector to input embedding	Position info must flow through every layer for attention to reflect it. No guarantee the property survives cleanly.
Modern LLMs	Inject position into the attention computation directly (T5, ALiBi, RoPE)	What we care about is similarity behavior at the attention layer; put the signal there.

The three modern schemes

Scheme	Where it sits	Learnable?	Property	Used today?
T5 relative bias	Added inside attention softmax	Yes	Learned scalar bias `b(m, n)` bucketized by relative distance	Some encoder-decoder models
ALiBi	Added inside attention softmax	No	Deterministic linear bias by relative distance, slope hard-coded per head	Some open-source models
RoPE (rotary)	Rotates query and key inside attention	No	Dot product of rotated vectors depends only on relative position `n - m`	Most modern LLMs

T5 relative bias, in brief

softmax( QK^T / sqrt(d) + b(m, n) )

The bias b(m, n) is a learned scalar that depends only on relative distance m - n. Relative distances are bucketized (all distances 5-8 share one bias, 9-16 share the next, etc.). Works; carries the same overfitting risk as learned input-position embeddings.

ALiBi, in brief

Same structure as T5 relative bias, but the bias is deterministic: a simple linear penalty proportional to relative distance, with a hard-coded slope per attention head. No learned parameters. Cheap; extrapolates well; never became dominant.

RoPE, in one paragraph

Rotate the query for the token at position m by angle θ_m. Rotate the key for the token at position n by angle θ_n. The dot product of the rotated q and rotated k contains a factor R(θ_n - θ_m), the rotation matrix of the relative angle. The result depends only on n - m, not on absolute positions.

The 2D rotation matrix

R(θ) = | cos θ   -sin θ |
       | sin θ    cos θ |

Property	Why it matters
Preserves vector magnitude	RoPE does not change “how big” the query/key vectors are, only their direction
Angles add under composition: `R(α) · R(β) = R(α + β)`	The dot-product-of-rotated-vectors property falls out of this
Extends to higher dimensions block-by-block	Real query/key vectors split into 2D blocks; each rotates independently at its own frequency

Beyond 2D: block-by-block rotation

Real query and key vectors are high-dimensional (hundreds to thousands of dimensions). RoPE splits them into 2D blocks and applies a rotation to each block independently. Each block uses a different angle following a frequency pattern:

ω_i = 10000^( -2i / d_model )
θ_m,i = m · ω_i

Low-index blocks rotate quickly with position; high-index blocks rotate slowly. The same frequency intuition as the sinusoidal formula from Phase 1, now applied inside the attention rotation.

Why RoPE became dominant

Property	Detail
Right behavior	Relative-distance-aware similarity, by construction
No learned parameters	Cannot overfit to training-set positional patterns
Cheap	Two element-wise multiplications per token per attention layer
Extends to any sequence length	Angles are deterministic functions of position
Long-term decay built in	Upper bound on attention weight shrinks with relative distance (with some oscillation)

Pitfalls to dodge

Pitfall	Reality
Sinusoidal and RoPE are the same because both use sin/cos	No. Sinusoidal (Phase 1) adds vectors to the input; RoPE rotates vectors inside the attention computation. Same trig machinery, completely different architectural placement.
RoPE has learnable parameters	No. The rotation angles are deterministic functions of position and dimension index.
All modern LLMs use RoPE	Most do; not all. Some still use ALiBi or older schemes. Check the model card.
T5 and ALiBi failed	They worked. They just did not become dominant. RoPE won on the combination of right-behavior, no parameters, and cheap implementation.

Translating model-card language

Model-card phrase	What it means
”Uses RoPE”	Position info is rotated into attention via query/key rotation; relative-distance-aware similarity by construction
”ALiBi position embeddings”	Deterministic linear bias added inside the attention softmax; no learned parameters
”T5 relative position bias”	Learned scalar bias added inside the attention softmax, bucketized by relative distance
”Learned absolute position embeddings”	Older scheme (Phase 1); one trainable vector per position added to input; bounded by training-set max sequence length

Glossary

Position embedding (or position encoding): any scheme for telling the model where each token sits in the sequence.
Structural shift: the move from adding position to the input (Phase 1) to injecting it directly into the attention computation (this lesson).
T5 relative bias: learned scalar bias added inside the attention softmax, bucketized by relative distance.
ALiBi (Attention with Linear Bias): deterministic linear bias added inside the attention softmax, slope hard-coded per attention head.
RoPE (Rotary Position Embeddings): rotation of the query and key vectors by position-dependent angles inside the attention computation.
Rotation matrix: the 2D matrix [[cos θ, -sin θ], [sin θ, cos θ]] that rotates a 2D vector by angle θ while preserving its magnitude.
Relative position: the gap between two tokens (n - m), as opposed to their absolute positions in the sequence.
Long-term decay: the property that the attention weight upper bound shrinks (with some oscillation) as relative distance grows; built into RoPE by construction.

Sinusoidal embeddings add position to the input.
RoPE rotates position into the attention itself.
The second is what modern LLMs actually use.