token ID → W_E[id] → dense vector → rest of the model
Token IDs are arbitrary numbers. Embeddings replace them with vectors that carry meaning. Everything downstream operates on the vectors, not on the IDs.
The picture: every word is a point on a high-dimensional map. Similar words cluster; topics form regions; consistent kinds of difference (gender, tense, country-and-capital) point along consistent directions.
Embedding dimension: typically 512 to 4096. Larger models use larger dimensions.
Matrix size: vocabulary count times embedding dimension. A 50k vocab at 4096 dim is 200 million parameters, a meaningful chunk of any model.
Cosine similarity range: -1 to +1. In many embedding spaces, above 0.7 reads as “very similar”; below 0 is “actively different in direction.”
Static vs contextual: the embedding matrix gives a static vector per token. Contextual meaning (the difference between “river bank” and “savings bank”) emerges from attention and the layers above, not from W_E.
Embedding dimension: the vector length per token, typically 512 to 4096.
Cosine similarity: the angle measure; how alike two vectors are in direction.
Vector arithmetic: adding and subtracting embeddings to capture relationships (“king - man + woman ≈ queen”).
Static vs contextual embedding: static = W_E[id], same in every context; contextual = what the model produces after attention layers operate on the static input.