Skip to content

Cheatsheet: Embeddings: how words become vectors with meaning

token ID → W_E[id] → dense vector → rest of the model

Token IDs are arbitrary numbers. Embeddings replace them with vectors that carry meaning. Everything downstream operates on the vectors, not on the IDs.

SchemeVector sizeCarries meaning?Verdict
Token ID alone1 numberNoCannot do matrix math
One-hot encodingvocab_size numbersNoWasteful; “king” and “octopus” equally far apart
Dense embeddingtypically 512 to 4096 numbersYesThe standard answer
PropertyValue
Shapevocab_size × embedding_dim
Each rowOne token’s dense vector
Lookupembedding(id) = W_E[id]
InitializationSmall random numbers
TrainingGradient descent moves rows toward useful positions
Place in pipelineVery first operation, before attention
OperationWhat it captures
Cosine similarity (angle between vectors)“Alike in meaning”
Vector subtraction (one vector minus another)“Different in this specific way”

The picture: every word is a point on a high-dimensional map. Similar words cluster; topics form regions; consistent kinds of difference (gender, tense, country-and-capital) point along consistent directions.

In a well-trained embedding space:

v(king) - v(man) + v(woman) ≈ v(queen)
CaveatReality
Cherry-pickedNot every word triple produces a clean parallelogram
Model-dependentToy embeddings on small data show none of this
ApproximateLands near, not on; nearest-neighbor lookup makes it legible
Use caseWhat it doesWhere you see it
Semantic searchEmbed query and docs; return closest by cosine similarityPinecone, Weaviate, Chroma, pgvector
Retrieval-augmented generationSame retrieval, plus stuffing docs into the prompt”Chatbot over your documents” features
Embedding APIs”Text in, vector out” service for downstream similarityembeddings.create() in any AI SDK
  • Embedding dimension: typically 512 to 4096. Larger models use larger dimensions.
  • Matrix size: vocabulary count times embedding dimension. A 50k vocab at 4096 dim is 200 million parameters, a meaningful chunk of any model.
  • Cosine similarity range: -1 to +1. In many embedding spaces, above 0.7 reads as “very similar”; below 0 is “actively different in direction.”
  • Static vs contextual: the embedding matrix gives a static vector per token. Contextual meaning (the difference between “river bank” and “savings bank”) emerges from attention and the layers above, not from W_E.
PitfallThe reality
Token equals sentence equals document embeddingDifferent scales; a “paragraph embedding” is not a row of W_E
Each dimension has a meaningDimensions are entangled; “gender direction” is a difference vector, not one column
Similarity equals semantic equivalenceCaptures topical relevance, not truth
Embeddings are neutralThey mirror the training corpus, biases included
Static equals contextual”Bank” gets the same row in both senses; context comes from attention
  • Embedding: a dense vector that represents a token (or text) for a model.
  • Embedding matrix (W_E): the trained lookup table; shape vocab_size × embedding_dim.
  • Embedding dimension: the vector length per token, typically 512 to 4096.
  • Cosine similarity: the angle measure; how alike two vectors are in direction.
  • Vector arithmetic: adding and subtracting embeddings to capture relationships (“king - man + woman ≈ queen”).
  • Static vs contextual embedding: static = W_E[id], same in every context; contextual = what the model produces after attention layers operate on the static input.

Words become vectors.
Meaning becomes geometry.