Skip to content

Lesson: How modern models inject position into attention (RoPE)

In Phase 1 (How models know word order) we saw the original answer to position embeddings: the model needs to know where each token sits in the sequence, and the 2017 transformer paper put that information in by adding a position vector (sinusoidal or learned) to the token embedding before the first attention layer. That answer worked. It is still in the textbooks. It is not what most modern LLMs do.

The reason is structural, not just empirical. This lesson is about the path from the 2017 answer to the modern answer, which centers on a scheme called RoPE (rotary position embeddings).

By the end you will know what the structural shift was, what the two intermediate schemes did (T5 relative bias and ALiBi), and what RoPE actually does in plain language. This is also the first of three lessons in this phase on the architectural changes that stuck after 2017. The next two cover normalization (LayerNorm to RMSNorm and the pre-norm shift) and attention efficiency tricks (sliding windows and the MQA/GQA progression). Most of the original transformer is intact in modern LLMs; these three are the parts that actually changed.

The structural shift: from the input to the attention layer

Section titled “The structural shift: from the input to the attention layer”

What we actually care about is making closer tokens more similar than distant tokens in the attention computation itself. (That goal, “closer tokens should be more similar than distant ones,” is the load-bearing intuition that runs through every position-embedding scheme; it is worth holding onto explicitly.) The sinusoidal scheme achieves the property indirectly: it adds a position vector to the input, that vector flows through the network, and somewhere downstream the attention computation ends up reflecting position. There is no guarantee the property survives every layer cleanly.

A cleaner approach is to put the position signal directly into the attention math, where the comparison between tokens happens. That is the move the field made, and the next two schemes are the ones that tried it.

T5 relative position bias. The T5 paper added a learned bias term inside the softmax of the attention computation. Instead of softmax(Q · K^T / sqrt(d)), you compute softmax(Q · K^T / sqrt(d) + b(m, n)), where b(m, n) is a learned scalar that depends only on the relative distance m - n. In practice the relative distances are bucketized (so all distances between 5 and 8 share one bias, all between 9 and 16 share the next, and so on), and the model learns each bucket’s value during training. The bias is just additive inside the softmax, so it does not need to satisfy any normalization constraint; the softmax handles that.

This works. It still has the learned-parameter issue: the bias values reflect what your training data looked like, with all the same overfitting concerns as learned input-position embeddings.

ALiBi (Attention with Linear Bias). Same idea, but deterministic. The bias inside the softmax is a simple linear function of the relative distance: the further apart m and n are, the more negative the bias, which pushes the softmax weight down. No learnable parameters. The slope of the linear function is set per attention head, hard-coded.

ALiBi is cheap, has no overfitting risk, and was shown to extrapolate to longer sequences than the model trained on. It earned a real place in the literature but never became dominant. Most modern models use a different scheme, which is the rest of this lesson.

RoPE stands for Rotary Position Embeddings. It is the position-encoding scheme used by most modern LLMs you have heard of. The intuition is worth slowing down for, because the math is dense but the idea is not.

The core move: instead of adding a position vector to the input, or adding a bias to the softmax, rotate the query and key vectors by an angle that depends on their position.

To get the intuition, start in 2D. A vector v = (x, y) sits somewhere in the 2D plane. To rotate that vector by an angle θ (counterclockwise from the positive x-axis), multiply it by the rotation matrix:

R(θ) = | cos θ -sin θ |
| sin θ cos θ |

The rotated vector R(θ) · v lands at the same distance from the origin (rotation preserves length) but pointing in a different direction. If you write v in polar form as length r and angle φ (so v = r · (cos φ, sin φ)), the rotated vector is r · (cos(φ + θ), sin(φ + θ)). The angle adds; the magnitude does not change.

That is the entire mechanism RoPE leverages. Hold onto it.

Now consider the attention computation. We have a query vector q for the token at position m and a key vector k for the token at position n. Their similarity is the dot product q · k.

RoPE rotates q by an angle that is a function of m. Call that angle θ_m. It rotates k by an angle that is a function of n. Call that angle θ_n. The new query is R(θ_m) · q and the new key is R(θ_n) · k.

Compute the dot product of the rotated vectors. The math (which the lecture leaves as an at-home derivation; we will not repeat the algebra here) shows that (R(θ_m) · q) · (R(θ_n) · k) contains a factor of R(θ_n - θ_m), the rotation matrix of the relative angle. The whole expression ends up being a function of θ_n - θ_m, which is itself a function of the relative position n - m.

To see why this matters in concrete terms, picture the 2D case again. The original query q for the token at position 5 starts pointing in some direction. The original query for the same word at position 50 would start pointing in the same direction (it is the same word). After RoPE, those two queries have been rotated by very different angles (the position-5 rotation is small, the position-50 rotation is large). When either one then takes a dot product with a key vector that was rotated by its own position’s angle, the result depends on the gap between the rotations, not on their absolute values. Two tokens five positions apart will produce a similar dot-product behavior whether they sit at positions 5 and 10 or at positions 500 and 505.

That is the result that matters. The dot product of two RoPE-rotated vectors depends on how far apart the tokens are, not on their absolute positions. Same property the sinusoidal embeddings had (Phase 1 covered this), but now baked directly into the attention dot product instead of injected at the input layer.

Real query and key vectors are not 2D; they are hundreds or thousands of dimensions. RoPE handles this by applying the rotation block by block: split the high-dimensional vector into 2D pieces (so a 768-dim vector becomes 384 separate 2D blocks) and apply a rotation to each block independently. Each block rotates by a different angle; the angles typically follow a frequency pattern related to the same omega_i = 10000^(-2i / d_model) formula that drove the original sinusoidal embeddings, so low-index blocks rotate quickly with position and high-index blocks rotate slowly.

The implementation is cheap. It is two element-wise multiplications per token per attention layer. No learned parameters, no extra matrices to multiply.

Three reasons RoPE became the dominant choice.

The property is right. Relative-distance-aware similarity falls out of the math, not out of learned parameters that can overfit.

The math has a long-term decay. A formal upper bound on the attention weight as a function of m - n shows that the bound shrinks as the relative distance grows. The bound is not perfectly monotonic (the lecture notes some oscillations along the way), but the long-run trend is decay: closer tokens have higher maximum possible attention weight than distant tokens, by construction of the rotation scheme. This matches the intuition we wanted in the first place.

It extends to any sequence length. Because the angles are deterministic functions of position, you can apply RoPE to positions you never saw during training. The model can technically attend to inputs longer than its training window. In practice the model’s downstream components were not trained on those longer sequences and may not handle them well, which is why RoPE scaling techniques like YaRN (Peng et al. 2023) and LongRoPE (Microsoft 2024) became standard for stretching pretrained 4K-32K models to 128K-2M context windows: they reshape RoPE’s frequency basis so the rotated query and key vectors stay in a regime the trained attention can handle. Most 2026 frontier models advertising 1M+ context route through YaRN or LongRoPE during a long-context fine-tuning stage. The lecture stops at the unscaled extension, but the production reality includes a scaling step.

Two consequences worth holding onto when you read AI tooling docs or model cards.

  • “Uses RoPE” is not vendor jargon, it is a load-bearing architectural choice. When a model card mentions RoPE, you now know what that means: position information is rotated into the attention computation itself, not added to the input. That is one of the few clean structural improvements the field has shipped over the original 2017 transformer.
  • Position-embedding choice is one of the few places the field actually moved on. Most of the original transformer architecture is intact in modern LLMs. Position embeddings, normalization, and attention efficiency are the parts that genuinely changed. If you read about “modern transformer tweaks” elsewhere, those three are usually the list (and the next two lessons in this phase cover the other two).

A few mistakes are common enough to be worth naming.

Conflating sinusoidal and RoPE. Both use sines and cosines. They are not the same scheme. Sinusoidal embeddings (covered in the Phase 1 lesson) are vectors added to the input before the first attention layer. RoPE is a rotation applied to the query and key vectors inside every attention layer. The math overlaps; the architectural placement is completely different.

Thinking RoPE has learnable parameters. It does not. The rotation angles are deterministic functions of position and dimension index, set by the same frequency formula sinusoidal used. There is nothing to fit during training in the position-embedding component itself.

Assuming all modern LLMs use RoPE. Most do; not all. Some open-source models still ship with ALiBi (or an earlier learned scheme); a few experiment with new schemes that have not yet displaced RoPE. When you read a model card, the position-embedding choice is one of the things to actually look at instead of assume.

Forgetting that all of these schemes are trying to encode the same property. Relative-distance-aware similarity in the attention computation. Learned position embeddings, sinusoidal, T5 relative bias, ALiBi, RoPE: they all aim at the same property. The methods differ; the goal is identical.

  • The structural shift was from input-added to attention-injected position information. What we care about is similarity behavior in the attention layer; injecting position info there directly is cleaner than adding to the input and hoping the property propagates through the network.
  • Two intermediate schemes earned a place but did not win. T5 relative bias adds a learned bias inside the softmax; ALiBi adds a deterministic linear bias. Both worked; neither became dominant.
  • RoPE rotates the query and key vectors. The dot product of rotated vectors depends on relative position. Cheap to compute, no learned parameters, extends to any sequence length, has long-term decay built in. Most modern LLMs use it.
  • The 2D intuition generalizes. Real attention vectors are high-dimensional; RoPE applies the rotation block by block, in 2D pieces, with angles drawn from a frequency pattern.

Sinusoidal embeddings add position to the input.
RoPE rotates position into the attention itself.
The second is what modern LLMs actually use.