Cheatsheet: How models know word order
The one idea that matters
Section titled “The one idea that matters”Transformers process all tokens in parallel.They cannot tell which token came first.Position information has to be added back explicitly.The two original schemes
Section titled “The two original schemes”| Scheme | How the vector is produced | Learnable? | Extrapolates? | Relative-distance property | Status |
|---|---|---|---|---|---|
| Learned position embeddings | One trainable vector per position; learned from data | Yes | No (bounded by training max) | Not by construction | Mostly historical |
| Sinusoidal position embeddings | Fixed sin/cos formula per position and dimension | No | Yes (formula works for any m) | Yes (dot product = function of m - n) | Still in textbooks |
The 2017 paper picked sinusoidal: comparable task performance, cleaner extrapolation.
The sinusoidal formula
Section titled “The sinusoidal formula”PE(m, 2i) = sin( m / 10000^(2i / d_model) )PE(m, 2i+1) = cos( m / 10000^(2i / d_model) )| Term | What it means |
|---|---|
m | Position in the sequence (first token = 1) |
i | Dimension index pair (one pair of sin/cos per pair of dimensions) |
d_model | Total embedding dimension |
10000^(2i / d_model) | Frequency denominator: low i gives fast oscillation (high freq), high i gives slow oscillation (low freq) |
Relative-distance property: cos(a - b) = cos a cos b + sin a sin b. The dot product of PE(m) and PE(n) collapses to a function of m - n only. Closer positions produce higher dot products; distant positions produce lower dot products.
What “added to the token embedding” means
Section titled “What “added to the token embedding” means”Input to first attention layer = e_token + p_position
e_token = embedding vector encoding WHAT the token is (meaning)p_position = position vector encoding WHERE the token sits (slot)Same word at position 3 vs position 7 arrives with a different vector because p_3 ≠ p_7.
Why the story is not over
Section titled “Why the story is not over”| Phase | Scheme | Where position info lives |
|---|---|---|
| 2017 transformer (Phase 1 lesson) | Learned or sinusoidal | Added to the input embedding |
| Modern LLMs (Phase 2 lesson) | T5 relative bias, ALiBi, RoPE | Injected into the attention computation itself |
Understanding why attention-injected is cleaner requires knowing what attention does. That is Phase 2.
Pitfalls to dodge
Section titled “Pitfalls to dodge”| Pitfall | Reality |
|---|---|
| ”The model just figures out word order from training” | No. Without an explicit position signal, the model cannot distinguish “the cat sat on the mat” from “the mat sat on the cat.” Position embeddings are required, not optional. |
| ”Position embeddings and token embeddings are the same thing” | No. Token embeddings encode what a token is (meaning). Position embeddings encode where it sits (slot). Both are vectors; both get added together; they carry different information. |
| ”Sinusoidal and RoPE are the same because both use sin/cos” | No. Sinusoidal adds vectors to the input. RoPE rotates vectors inside the attention computation. Same trig machinery; different architectural placement. |
Glossary
Section titled “Glossary”- Position embedding (position encoding): any scheme that tells the model where each token sits in the sequence, in a form the model can use.
- Permutation invariance: the property that the attention mechanism produces the same output regardless of the order of input tokens, if no position info is attached. The reason position embeddings must exist.
- Learned position embeddings: one trainable vector per position, added to the input. Bounded by training-set max length.
- Sinusoidal position embeddings: fixed
sin/cosformula per position and dimension, added to the input. Extrapolates to any length; relative-distance-aware by construction. - Relative distance:
n - m, the gap between two token positions, as opposed to their absolute positions in the sequence.
Without an explicit position signal, the model treats your sentence as a bag of vectors.
The 2017 transformer added a position vector to each token before the rest of the model saw it.
Sinusoidal embeddings work, extend to any sequence length, and naturally encode relative distance.