Skip to content

Cheatsheet: How models know word order

Transformers process all tokens in parallel.
They cannot tell which token came first.
Position information has to be added back explicitly.
SchemeHow the vector is producedLearnable?Extrapolates?Relative-distance propertyStatus
Learned position embeddingsOne trainable vector per position; learned from dataYesNo (bounded by training max)Not by constructionMostly historical
Sinusoidal position embeddingsFixed sin/cos formula per position and dimensionNoYes (formula works for any m)Yes (dot product = function of m - n)Still in textbooks

The 2017 paper picked sinusoidal: comparable task performance, cleaner extrapolation.

PE(m, 2i) = sin( m / 10000^(2i / d_model) )
PE(m, 2i+1) = cos( m / 10000^(2i / d_model) )
TermWhat it means
mPosition in the sequence (first token = 1)
iDimension index pair (one pair of sin/cos per pair of dimensions)
d_modelTotal embedding dimension
10000^(2i / d_model)Frequency denominator: low i gives fast oscillation (high freq), high i gives slow oscillation (low freq)

Relative-distance property: cos(a - b) = cos a cos b + sin a sin b. The dot product of PE(m) and PE(n) collapses to a function of m - n only. Closer positions produce higher dot products; distant positions produce lower dot products.

What “added to the token embedding” means

Section titled “What “added to the token embedding” means”
Input to first attention layer = e_token + p_position
e_token = embedding vector encoding WHAT the token is (meaning)
p_position = position vector encoding WHERE the token sits (slot)

Same word at position 3 vs position 7 arrives with a different vector because p_3 ≠ p_7.

PhaseSchemeWhere position info lives
2017 transformer (Phase 1 lesson)Learned or sinusoidalAdded to the input embedding
Modern LLMs (Phase 2 lesson)T5 relative bias, ALiBi, RoPEInjected into the attention computation itself

Understanding why attention-injected is cleaner requires knowing what attention does. That is Phase 2.

PitfallReality
”The model just figures out word order from training”No. Without an explicit position signal, the model cannot distinguish “the cat sat on the mat” from “the mat sat on the cat.” Position embeddings are required, not optional.
”Position embeddings and token embeddings are the same thing”No. Token embeddings encode what a token is (meaning). Position embeddings encode where it sits (slot). Both are vectors; both get added together; they carry different information.
”Sinusoidal and RoPE are the same because both use sin/cos”No. Sinusoidal adds vectors to the input. RoPE rotates vectors inside the attention computation. Same trig machinery; different architectural placement.
  • Position embedding (position encoding): any scheme that tells the model where each token sits in the sequence, in a form the model can use.
  • Permutation invariance: the property that the attention mechanism produces the same output regardless of the order of input tokens, if no position info is attached. The reason position embeddings must exist.
  • Learned position embeddings: one trainable vector per position, added to the input. Bounded by training-set max length.
  • Sinusoidal position embeddings: fixed sin/cos formula per position and dimension, added to the input. Extrapolates to any length; relative-distance-aware by construction.
  • Relative distance: n - m, the gap between two token positions, as opposed to their absolute positions in the sequence.

Without an explicit position signal, the model treats your sentence as a bag of vectors.
The 2017 transformer added a position vector to each token before the rest of the model saw it.
Sinusoidal embeddings work, extend to any sequence length, and naturally encode relative distance.