Embeddings: cheatsheet

The one idea that matters

token ID  →  W_E[id]  →  dense vector  →  rest of the model

Token IDs are arbitrary numbers. Embeddings replace them with vectors that carry meaning. Everything downstream operates on the vectors, not on the IDs.

Why integer IDs aren’t enough

Scheme	Vector size	Carries meaning?	Verdict
Token ID alone	1 number	No	Cannot do matrix math
One-hot encoding	`vocab_size` numbers	No	Wasteful; “king” and “octopus” equally far apart
Dense embedding	typically 512 to 4096 numbers	Yes	The standard answer

The embedding matrix `W_E`

Property	Value
Shape	`vocab_size × embedding_dim`
Each row	One token’s dense vector
Lookup	`embedding(id) = W_E[id]`
Initialization	Small random numbers
Training	Gradient descent moves rows toward useful positions
Place in pipeline	Very first operation, before attention

Meaning is geometry

Operation	What it captures
Cosine similarity (angle between vectors)	“Alike in meaning”
Vector subtraction (one vector minus another)	“Different in this specific way”

The picture: every word is a point on a high-dimensional map. Similar words cluster; topics form regions; consistent kinds of difference (gender, tense, country-and-capital) point along consistent directions.

The king-queen demonstration

In a well-trained embedding space:

v(king) - v(man) + v(woman) ≈ v(queen)

Caveat	Reality
Cherry-picked	Not every word triple produces a clean parallelogram
Model-dependent	Toy embeddings on small data show none of this
Approximate	Lands near, not on; nearest-neighbor lookup makes it legible

Why this matters in production

Use case	What it does	Where you see it
Semantic search	Embed query and docs; return closest by cosine similarity	Pinecone, Weaviate, Chroma, pgvector
Retrieval-augmented generation	Same retrieval, plus stuffing docs into the prompt	”Chatbot over your documents” features
Embedding APIs	”Text in, vector out” service for downstream similarity	`embeddings.create()` in any AI SDK

Practical numbers

Embedding dimension: typically 512 to 4096. Larger models use larger dimensions.
Matrix size: vocabulary count times embedding dimension. A 50k vocab at 4096 dim is 200 million parameters, a meaningful chunk of any model.
Cosine similarity range: -1 to +1. In many embedding spaces, above 0.7 reads as “very similar”; below 0 is “actively different in direction.”
Static vs contextual: the embedding matrix gives a static vector per token. Contextual meaning (the difference between “river bank” and “savings bank”) emerges from attention and the layers above, not from W_E.

Pitfalls to dodge

Pitfall	The reality
Token equals sentence equals document embedding	Different scales; a “paragraph embedding” is not a row of `W_E`
Each dimension has a meaning	Dimensions are entangled; “gender direction” is a difference vector, not one column
Similarity equals semantic equivalence	Captures topical relevance, not truth
Embeddings are neutral	They mirror the training corpus, biases included
Static equals contextual	”Bank” gets the same row in both senses; context comes from attention

Glossary

Embedding: a dense vector that represents a token (or text) for a model.
Embedding matrix (W_E): the trained lookup table; shape vocab_size × embedding_dim.
Embedding dimension: the vector length per token, typically 512 to 4096.
Cosine similarity: the angle measure; how alike two vectors are in direction.
Vector arithmetic: adding and subtracting embeddings to capture relationships (“king - man + woman ≈ queen”).
Static vs contextual embedding: static = W_E[id], same in every context; contextual = what the model produces after attention layers operate on the static input.

Words become vectors.
Meaning becomes geometry.