How models know word order: cheatsheet

The one idea that matters

Transformers process all tokens in parallel.
They cannot tell which token came first.
Position information has to be added back explicitly.

The two original schemes

Scheme	How the vector is produced	Learnable?	Extrapolates?	Relative-distance property	Status
Learned position embeddings	One trainable vector per position; learned from data	Yes	No (bounded by training max)	Not by construction	Mostly historical
Sinusoidal position embeddings	Fixed `sin/cos` formula per position and dimension	No	Yes (formula works for any `m`)	Yes (dot product = function of `m - n`)	Still in textbooks

The 2017 paper picked sinusoidal: comparable task performance, cleaner extrapolation.

The sinusoidal formula

PE(m, 2i)   = sin( m / 10000^(2i / d_model) )
PE(m, 2i+1) = cos( m / 10000^(2i / d_model) )

Term	What it means
`m`	Position in the sequence (first token = 1)
`i`	Dimension index pair (one pair of sin/cos per pair of dimensions)
`d_model`	Total embedding dimension
`10000^(2i / d_model)`	Frequency denominator: low `i` gives fast oscillation (high freq), high `i` gives slow oscillation (low freq)

Relative-distance property: cos(a - b) = cos a cos b + sin a sin b. The dot product of PE(m) and PE(n) collapses to a function of m - n only. Closer positions produce higher dot products; distant positions produce lower dot products.

What “added to the token embedding” means

Input to first attention layer = e_token + p_position

e_token   = embedding vector encoding WHAT the token is (meaning)
p_position = position vector encoding WHERE the token sits (slot)

Same word at position 3 vs position 7 arrives with a different vector because p_3 ≠ p_7.

Why the story is not over

Phase	Scheme	Where position info lives
2017 transformer (Phase 1 lesson)	Learned or sinusoidal	Added to the input embedding
Modern LLMs (Phase 2 lesson)	T5 relative bias, ALiBi, RoPE	Injected into the attention computation itself

Understanding why attention-injected is cleaner requires knowing what attention does. That is Phase 2.

Pitfalls to dodge

Pitfall	Reality
”The model just figures out word order from training”	No. Without an explicit position signal, the model cannot distinguish “the cat sat on the mat” from “the mat sat on the cat.” Position embeddings are required, not optional.
”Position embeddings and token embeddings are the same thing”	No. Token embeddings encode what a token is (meaning). Position embeddings encode where it sits (slot). Both are vectors; both get added together; they carry different information.
”Sinusoidal and RoPE are the same because both use sin/cos”	No. Sinusoidal adds vectors to the input. RoPE rotates vectors inside the attention computation. Same trig machinery; different architectural placement.

Glossary

Position embedding (position encoding): any scheme that tells the model where each token sits in the sequence, in a form the model can use.
Permutation invariance: the property that the attention mechanism produces the same output regardless of the order of input tokens, if no position info is attached. The reason position embeddings must exist.
Learned position embeddings: one trainable vector per position, added to the input. Bounded by training-set max length.
Sinusoidal position embeddings: fixed sin/cos formula per position and dimension, added to the input. Extrapolates to any length; relative-distance-aware by construction.
Relative distance: n - m, the gap between two token positions, as opposed to their absolute positions in the sequence.

Without an explicit position signal, the model treats your sentence as a bag of vectors.
The 2017 transformer added a position vector to each token before the rest of the model saw it.
Sinusoidal embeddings work, extend to any sequence length, and naturally encode relative distance.