Embeddings: word vectors, in brief

What you’ll learn

This is lesson 2 of Phase 1 (How models read text) in Track 5 (AI Foundations). The Stanford CME 295 course materials (syllabus, schedule, the Amidi cheatsheets) are at cme295.stanford.edu. The previous lesson covered tokenization, the bridge from raw text to integer IDs. But a token ID is just a number with no meaning attached, so how does the model know that “king” is closer in meaning to “queen” than to “banana”?

The answer is embeddings: dense vectors, trained to capture meaning, that take the place of the token IDs before the rest of the model touches them. The lesson builds the intuition (words as points on a high-dimensional map), walks through the embedding lookup table that stores them (one row per word in the vocabulary, denoted W_E), pays off the king-queen demonstration as actual vector arithmetic (Mikolov et al., Word2Vec 2013, four years before the transformer paper), and closes on why every modern semantic-search and retrieval-augmented-generation system you have heard of runs on this idea.

Where this fits

This is lesson 2 of 3 in Phase 1, How models read text. The previous lesson covered tokenization (how raw text becomes token IDs). This lesson covers the next step: how those token IDs become dense vectors that carry semantic meaning. The next lesson, How models know word order, adds positional information to the vectors this lesson produces. Together, all three Phase 1 lessons trace a sentence from raw text into the sequence of dense, position-aware vectors the model actually operates on. Phase 2 then builds attention on top of those vectors.

Before you start

Prerequisites: the tokenization lesson is the recommended starting point for Track 5; we briefly recap what a token ID is in the body of this lesson, but the tokens lesson gives the fuller picture. Some basic intuition for what a vector is (a list of numbers, or a direction in space) helps. If “vector” is unfamiliar, the first episode of 3Blue1Brown’s Essence of Linear Algebra (about 10 minutes) is the gentlest entry point.

By the end, you’ll be able to

Explain why integer token IDs alone are not enough for the model and why dense vectors are the next step
Describe the embedding lookup table as a vocab_size × embedding_dim matrix where each row is one token’s vector
Demonstrate how vector arithmetic captures semantic relationships, using the canonical “king − man + woman ≈ queen” example
Distinguish static token embeddings (the same vector for “bank” in “river bank” and “savings bank”) from the contextual meaning attention adds in later layers
Connect embeddings to real-world AI applications, including vector search and retrieval-augmented generation

Time and difficulty

Read time: about 22 minutes
Practice time: about 15 minutes (vector arithmetic on paper, plus a quick embedding-search inspection)
Difficulty: standard