How models know word order: brief

What you’ll learn

This is the closer of Phase 1 (How models read text). The previous two lessons walked through what happens to a sentence on the way into a model: it gets split into tokens, each token becomes a dense vector, and you end up with a sequence of vectors ready to be fed in. There is one piece left. The vector for cat is the same vector whether cat shows up at the start of the sentence or in the middle. If the model only sees the bag of vectors, it cannot tell the cat sat on the mat apart from the mat sat on the cat. Position information has to be added explicitly. This lesson covers why and how. The 2017 transformer paper proposed two schemes (a learned vector per position, or a fixed sin/cos formula). Both add the position vector to the token embedding before the rest of the model sees the input. The paper picked the sinusoidal one for the extrapolation advantage, and that is the answer that lives in textbooks. Modern LLMs have moved on, but the next step in that story (RoPE) requires understanding attention first, so it lives in Phase 2.

Where this fits

This is lesson 3 of 3 in Phase 1. The previous lesson was How words become vectors (embeddings). The next lesson, opening Phase 2, is How attention works. Once Phase 2 has built up attention, the modern position-embedding story (T5 bias, ALiBi, RoPE) becomes legible; that is covered in How modern models inject position into attention (RoPE) later in Phase 2.

Before you start

Prerequisites: the embeddings lesson. You should be comfortable with the idea that each token becomes a dense vector and that vectors can be added together. No attention math required (that is intentionally the next phase).

By the end, you’ll be able to

Explain why a model that processes all tokens in parallel cannot tell word order on its own
Distinguish learned and sinusoidal position embeddings, and explain why the 2017 transformer paper picked sinusoidal
Describe what “added to the token embedding” means concretely (positions ride along with token meaning in the same vector)
Recognize that the original schemes are not what most modern LLMs use, and know that the next phase covers what changed

Time and difficulty

Read time: about 12 minutes
Practice time: about 10 minutes (a small concrete sinusoidal-formula walk-through, plus flashcards)
Difficulty: standard