Skip to content

Cheatsheet: How modern models inject position into attention (RoPE)

What we care about: closer tokens more similar than distant tokens,
in the attention computation.
Original (Phase 1): add position to the input. Indirect.
Modern (this lesson): rotate position into the attention itself. Direct.
EraApproachWhy we moved on
2017 transformer (Phase 1)Add position vector to input embeddingPosition info must flow through every layer for attention to reflect it. No guarantee the property survives cleanly.
Modern LLMsInject position into the attention computation directly (T5, ALiBi, RoPE)What we care about is similarity behavior at the attention layer; put the signal there.
SchemeWhere it sitsLearnable?PropertyUsed today?
T5 relative biasAdded inside attention softmaxYesLearned scalar bias b(m, n) bucketized by relative distanceSome encoder-decoder models
ALiBiAdded inside attention softmaxNoDeterministic linear bias by relative distance, slope hard-coded per headSome open-source models
RoPE (rotary)Rotates query and key inside attentionNoDot product of rotated vectors depends only on relative position n - mMost modern LLMs
softmax( QK^T / sqrt(d) + b(m, n) )

The bias b(m, n) is a learned scalar that depends only on relative distance m - n. Relative distances are bucketized (all distances 5-8 share one bias, 9-16 share the next, etc.). Works; carries the same overfitting risk as learned input-position embeddings.

Same structure as T5 relative bias, but the bias is deterministic: a simple linear penalty proportional to relative distance, with a hard-coded slope per attention head. No learned parameters. Cheap; extrapolates well; never became dominant.

Rotate the query for the token at position m by angle θ_m. Rotate the key for the token at position n by angle θ_n. The dot product of the rotated q and rotated k contains a factor R(θ_n - θ_m), the rotation matrix of the relative angle. The result depends only on n - m, not on absolute positions.

R(θ) = | cos θ -sin θ |
| sin θ cos θ |
PropertyWhy it matters
Preserves vector magnitudeRoPE does not change “how big” the query/key vectors are, only their direction
Angles add under composition: R(α) · R(β) = R(α + β)The dot-product-of-rotated-vectors property falls out of this
Extends to higher dimensions block-by-blockReal query/key vectors split into 2D blocks; each rotates independently at its own frequency

Real query and key vectors are high-dimensional (hundreds to thousands of dimensions). RoPE splits them into 2D blocks and applies a rotation to each block independently. Each block uses a different angle following a frequency pattern:

ω_i = 10000^( -2i / d_model )
θ_m,i = m · ω_i

Low-index blocks rotate quickly with position; high-index blocks rotate slowly. The same frequency intuition as the sinusoidal formula from Phase 1, now applied inside the attention rotation.

PropertyDetail
Right behaviorRelative-distance-aware similarity, by construction
No learned parametersCannot overfit to training-set positional patterns
CheapTwo element-wise multiplications per token per attention layer
Extends to any sequence lengthAngles are deterministic functions of position
Long-term decay built inUpper bound on attention weight shrinks with relative distance (with some oscillation)
PitfallReality
Sinusoidal and RoPE are the same because both use sin/cosNo. Sinusoidal (Phase 1) adds vectors to the input; RoPE rotates vectors inside the attention computation. Same trig machinery, completely different architectural placement.
RoPE has learnable parametersNo. The rotation angles are deterministic functions of position and dimension index.
All modern LLMs use RoPEMost do; not all. Some still use ALiBi or older schemes. Check the model card.
T5 and ALiBi failedThey worked. They just did not become dominant. RoPE won on the combination of right-behavior, no parameters, and cheap implementation.
Model-card phraseWhat it means
”Uses RoPE”Position info is rotated into attention via query/key rotation; relative-distance-aware similarity by construction
”ALiBi position embeddings”Deterministic linear bias added inside the attention softmax; no learned parameters
”T5 relative position bias”Learned scalar bias added inside the attention softmax, bucketized by relative distance
”Learned absolute position embeddings”Older scheme (Phase 1); one trainable vector per position added to input; bounded by training-set max sequence length
  • Position embedding (or position encoding): any scheme for telling the model where each token sits in the sequence.
  • Structural shift: the move from adding position to the input (Phase 1) to injecting it directly into the attention computation (this lesson).
  • T5 relative bias: learned scalar bias added inside the attention softmax, bucketized by relative distance.
  • ALiBi (Attention with Linear Bias): deterministic linear bias added inside the attention softmax, slope hard-coded per attention head.
  • RoPE (Rotary Position Embeddings): rotation of the query and key vectors by position-dependent angles inside the attention computation.
  • Rotation matrix: the 2D matrix [[cos θ, -sin θ], [sin θ, cos θ]] that rotates a 2D vector by angle θ while preserving its magnitude.
  • Relative position: the gap between two tokens (n - m), as opposed to their absolute positions in the sequence.
  • Long-term decay: the property that the attention weight upper bound shrinks (with some oscillation) as relative distance grows; built into RoPE by construction.

Sinusoidal embeddings add position to the input.
RoPE rotates position into the attention itself.
The second is what modern LLMs actually use.