Skip to content

Summary: Embeddings: how words become vectors with meaning

Embeddings are dense vectors that replace token IDs before the model does its math, so the model can work on meaning rather than on arbitrary integers. Without them, “king” and “queen” would be exactly as far from each other as “king” and “octopus”, and every word relationship would have to be learned from scratch.

With embeddings, meaning becomes geometry: similar words cluster, consistent kinds of difference point along consistent directions, and “king minus man plus woman” lands near “queen” by actual arithmetic.

This summary is the scan-it-in-five-minutes version. The full lesson covers why integer IDs aren’t enough, the words-on-a-map intuition, the embedding matrix as a lookup table, the king-queen demonstration paid off, and why many modern semantic-search and retrieval-augmented systems run on this idea.

  • Token IDs alone have no meaning. They are arbitrary positions in the vocabulary. The model would have to learn every word relationship from scratch with no help from the input format.
  • The naive fix fails. A one-hot vector for each token would be vocab_size wide (mostly empty space), and the geometry would carry no information about meaning: “king” would sit exactly as far from “queen” as it does from “octopus”.
  • The real fix is dense embeddings. Each token becomes a vector of real numbers, typically 512 to 4096 dimensions wide, where similar words land near each other and consistent contrasts produce consistent direction-vectors.
  • Meaning is geometry. Picture every word as a point on a high-dimensional map. Similar words sit close together; unrelated words sit far apart; topics form regions; certain consistent kinds of difference (gender, tense, country-and-capital) point along consistent directions across the map.
  • The mechanism is a lookup table. The embedding matrix W_E has shape vocab_size × embedding_dim, one row per token in the vocabulary. To embed token ID i, you go to row i of the matrix and read out the dense vector. Embedding lookup is the very first operation in every transformer.
  • The matrix is trained, not designed. Rows start as small random numbers when the model is created. Gradient descent during training pushes each row toward a position on the map of meaning that is useful for predicting the words that come next in real text.
  • Two operations make the geometry useful. Cosine similarity (the angle between two vectors) captures “alike in meaning.” Vector subtraction captures “different in this specific way.” Together they let you ask similarity and analogy questions of an embedding space using arithmetic alone.
  • The king-and-queen demonstration, due to Mikolov et al. on Word2Vec in 2013, is the canonical example. In well-trained embedding spaces, king - man + woman lands near queen. Three caveats: (1) the famous demos are cherry-picked from many candidates, (2) the result depends on the model and the corpus, (3) the result is approximate (lands near, not on). The point is that meaning has geometric structure, not that AI does flawless analogy.
  • Embeddings connect to attention. The Q, K, V vectors from the attention lesson are produced from the embeddings, by multiplying each token’s embedding by the trained W_Q, W_K, W_V matrices. Embeddings are the first learned representation in the model; everything downstream operates on them.
  • Three real-world implications worth holding in your head. (1) Semantic search: when you search “how to fix a slow laptop”, the system finds documents in the same neighborhood on the map even when none use the words “slow” or “laptop”. This is the engine inside vector databases like Pinecone, Weaviate, Chroma, and pgvector. (2) Retrieval-augmented generation (RAG): the chatbot embeds your question and your documents, retrieves the closest documents by cosine similarity, and stuffs them into the prompt. The retrieval step is pure embedding similarity. (3) Embedding APIs: the “text in, vector out” service AI providers ship for downstream similarity work.
  • Pitfalls worth naming. Token embeddings are not the same as sentence or document embeddings (different scales); embedding dimensions do not have human-readable meaning (the “gender direction” is a difference vector, not one component); a static embedding for “bank” is the same in “river bank” and “savings bank” (contextual meaning emerges from the layers above); embedding similarity captures topical relevance, not truth; embeddings reflect biases in their training data.

Before this lesson, “embedding” was a piece of jargon you saw in tooling docs and AI announcements. Now it is a specific operation: one row of a learned matrix, replacing a token’s integer ID, carrying meaning into the model as geometry. When you read about a vector database, build a RAG pipeline, call an embeddings API directly, or wonder why semantic search returned a document that did not share any words with your query, you can reason about what the math is actually doing instead of taking it on faith. The next lesson, on multi-head attention, picks up where this and the attention lesson stopped: now that every token has a meaningful vector, how does the model run many parallel attention computations on those vectors per layer to capture different kinds of context at the same time?

Words become vectors.
Meaning becomes geometry.