Practice: Embeddings: how words become vectors with meaning

Self-check

Seven short questions. Try to answer each one in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.

1. In one sentence, what is the problem with using only token IDs as the model’s input?

Show answer

Token IDs are arbitrary numbers; they carry no information about meaning, similarity, or structure. The model would have to learn every relationship between every word pair from scratch with no help from the input format.

2. What is the shape of the embedding matrix W_E, and what does each row represent?

Show answer

Shape is vocab_size × embedding_dim, typically tens of thousands of rows by hundreds to thousands of columns (a common combination is 50,000 by 768). Each row is one token’s dense embedding vector, the position of that token on the model’s high-dimensional map of meaning.

3. The lesson says “meaning is geometry.” What two geometric operations make that statement actionable?

Show answer

Cosine similarity (the angle between two vectors) measures “alike in meaning.” Vector subtraction captures “different in this specific way.” Together they let you ask similarity and analogy questions of an embedding space using arithmetic alone.

4. The “king minus man plus woman is approximately queen” example is famously cited. Name one thing it does demonstrate and one thing it does NOT demonstrate.

Show answer

Does demonstrate: that meaning has geometric structure in well-trained embedding spaces. The vector from “man” to “woman” encodes a specific kind of difference, and that same direction often separates other gendered pairs like “king” and “queen.”

Does not demonstrate: that AI does flawless analogy. The famous demos are cherry-picked from many candidates, the result is approximate (lands near, not exactly on, the target word), and a toy embedding trained on a small corpus may show none of these relationships.

5. In a static token embedding (one row of W_E), is the embedding for “bank” different in “river bank” versus “savings bank”?

Show answer

No. The static token embedding is the same row of W_E regardless of context; “bank” maps to one row of the matrix. The contextual meaning (river versus savings) emerges from how the model uses the embedding in the layers above, especially attention. Static embedding = topical seed; contextual meaning = what the layers do with it.

6. Name three production AI workflows that are built on embedding similarity.

Show answer

(1) Semantic search, where queries and documents are embedded and the closest matches by cosine similarity are returned. (2) Retrieval-augmented generation (RAG), where the same similarity step retrieves relevant documents that are then added to the prompt. (3) Embedding APIs, the raw “text in, vector out” service that AI providers ship for downstream similarity work. All three are the same idea, just exposed at different levels of the stack.

7. Fill in the blank. “An ID is just a ______. An embedding is what that ______ means.”

Show answer

Number and number. The whole conceptual move of this lesson is replacing arbitrary integers with vectors that carry semantic structure.

Try it yourself: vector arithmetic on a toy embedding space

This is the king-minus-man-plus-woman trick, made small enough that you can do every step by hand. About 15 minutes with a pen.

Side effects: none. This is paper arithmetic on numbers we made up. No API calls, no tooling, no costs.

Setup: imagine a tiny embedding space with just 2 dimensions, populated with the following vectors. (Real embeddings have hundreds of dimensions; the geometry behaves the same way, just hard to draw.)

v(man)      = [1, 0]
v(woman)    = [1, 2]
v(king)     = [3, 1]

candidates we'll compare against:
v(queen)    = [3, 3]
v(princess) = [2, 2]
v(prince)   = [3, 1.5]
v(carrot)   = [-2, -1]

Steps:

Compute the analogy vector v(king) - v(man) + v(woman). Write down the resulting 2D vector.
Compute cosine similarity between your result and each of the four candidate vectors. The cosine similarity of two vectors a and b is (a · b) / (|a| × |b|), where a · b is the dot product and |a| is the length (the square root of the sum of squares of the components). Round each similarity to three decimals.
Rank the candidates from most similar to least similar. Which one is the analogy vector closest to?

Expected outcome:

Step 1: v(king) - v(man) + v(woman) = [3, 1] - [1, 0] + [1, 2] = [3, 3].
Step 2: cosine similarities, with the analogy vector [3, 3]:
- queen [3, 3]: dot product = 18, lengths = √18 × √18 = 18, similarity = 1.000
- princess [2, 2]: dot product = 12, lengths = √18 × √8 = √144 = 12, similarity = 1.000
- prince [3, 1.5]: dot product = 13.5, lengths = √18 × √11.25 ≈ 4.243 × 3.354 ≈ 14.230, similarity ≈ 0.949
- carrot [-2, -1]: dot product = -9, lengths = √18 × √5 ≈ 4.243 × 2.236 ≈ 9.487, similarity ≈ -0.949
Step 3: queen and princess tie for first (cosine similarity 1.000 each, since both lie on the same line through the origin), prince close behind, carrot far away on the opposite side.

If your numbers match, you have just done the same arithmetic that makes vector search work in production. Real embeddings have hundreds of dimensions and millions of candidate words rather than four; the math is identical.

A thinking question: queen and princess both got cosine similarity 1.000. Is that a real result, or an artifact of our toy setup?

Show answer

It is an artifact of the toy setup. In 2 dimensions, cosine similarity only measures the angle from the origin; any two vectors that lie on the same ray (here, both pointing up-and-to-the-right at the same slope) get a similarity of 1.0 even if their lengths differ. In real embeddings with hundreds of dimensions, vectors with the same direction but different lengths are vanishingly rare, and the candidates separate cleanly. Real-world systems also often combine cosine similarity with other ranking signals (frequency, recency, document quality) to break the kinds of ties this toy demonstrates.

Inspect a real embedding space (optional)

If you want to see embeddings on a corpus rather than on toy vectors, the TensorFlow Embedding Projector loads several pre-trained embedding spaces into a 3D viewer with built-in nearest-neighbor search. Click a word and the panel on the right shows its closest neighbors by cosine similarity. The neighbors are usually unsurprising in a satisfying way: “doctor” pulls up “physician, hospital, surgeon”; “Paris” pulls up “France, London, Berlin.” The 3D view is a projection, not the real high-dimensional space, so do not over-read clusters; the nearest-neighbor list is the trustworthy signal.

This takes 5 minutes if you stop after a few searches. No account required, no costs.

Flashcards

Twelve cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What does an embedding do, in one sentence?

It replaces a token’s arbitrary integer ID with a dense vector that carries meaning, before the rest of the model touches the input.

Q. What is the shape of the embedding matrix?

vocab_size × embedding_dim. Vocab is in the tens to hundreds of thousands; dim is typically 512 to 4096. Each row is one token’s vector.

Q. How are embeddings produced when a model is created versus when training is done?

Random small numbers at initialization. After training, each row has been pushed toward a position where words with similar roles end up with similar vectors and consistent contrasts produce consistent direction-vectors.

Q. What is cosine similarity, in plain terms?

A measure of the angle between two vectors. Ranges from -1 (opposite) to +1 (same direction). Standard tool for “are these two embeddings alike in meaning?”

Q. What does "king minus man plus woman is approximately queen" demonstrate?

That meaning has geometric structure in a well-trained embedding space. The same direction-vector that separates “man” from “woman” often also separates “king” from “queen.” Vector arithmetic captures the relationship.

Q. What does that demonstration NOT demonstrate?

That AI does flawless analogy. The famous demos are cherry-picked, the result is approximate (lands near, not on), and toy embeddings show none of these patterns. The pattern depends on the model and the corpus.

Q. Why is the embedding matrix one of the largest single objects in a modern model?

For a 50,000-token vocabulary at 4,096 embedding dimensions, the matrix has 200 million parameters. It is doing the load-bearing work of turning discrete symbols into geometric meaning, so it earns the space.

Q. How do attention's Q, K, V vectors relate to embeddings?

They are produced from the embeddings, not from the token IDs directly. Each token’s embedding is multiplied by the trained W_Q, W_K, W_V matrices to get its query, key, and value vectors.

Q. What is semantic search?

Search that finds the documents whose embeddings are closest to the query’s embedding by cosine similarity, rather than matching literal words. The retrieved documents do not have to share any words with the query, just the meaning.

Q. What is RAG, in one sentence?

Retrieval-augmented generation: a chatbot embeds the user’s question and the document corpus, retrieves the closest documents by embedding similarity, and includes them in the prompt before generating an answer. The retrieval step is pure embedding similarity.

Q. Is the static token embedding for "bank" the same in "river bank" and "savings bank"?

Yes, it is the same row of the embedding matrix in both cases. Contextual meaning emerges from the layers above (especially attention), not from the static embedding alone.

Q. What is the one-sentence takeaway from this lesson?

Words become vectors. Meaning becomes geometry.