References: Embeddings: how words become vectors with meaning

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025, Lecture 1
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  YouTube: https://www.youtube.com/watch?v=Ub3GoFaUcds
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  License (lecture video): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lecture remain with Stanford and
the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more” pile.

The Illustrated Word2Vec by Jay Alammar. The most widely-shared visual walkthrough of word embeddings on the public web. Where this lesson keeps the math minimal, Alammar draws the matrices in full color and shows the training loop step by step. Best companion if you want to see embeddings rendered with diagrams.
TensorFlow Embedding Projector. The interactive 3D viewer we used in the practice section. Loads several pre-trained embedding spaces with built-in nearest-neighbor search. Click any word and see its closest neighbors by cosine similarity. The 3D view is a projection of the real high-dimensional space; do not over-read clusters; the nearest-neighbor list is trustworthy.
Pinecone’s “What are Vector Embeddings”. Practitioner-level introduction to embeddings as the foundation of vector search. Light on theory, heavy on “here is how to use this in production.” Useful if you are about to build a semantic-search or RAG feature and want a vendor-neutral overview before reaching for any specific tool.
Sentence-Transformers library by Nils Reimers. The standard open-source library for sentence-level (rather than token-level) embeddings. Ships pre-trained models for many tasks (semantic similarity, clustering, retrieval) and is the go-to if you want to embed paragraphs or documents for production search.
Stanford CME 295 cheatsheet by the Amidi twins. Their MIT-licensed cheatsheet covers embeddings on its NLP page (alongside RNNs and the attention sequence). The two-column layout puts the formal vector definitions next to the geometric pictures the lesson body builds; useful as a single-page contrast against this lesson’s prose.

Adjacent topics

Topics that build on or sit beside this one. Some are upcoming Clawdemy lessons; some are pointers outside the course.

Attention, multi-head attention, the full transformer block. The remaining lessons in our Lecture 1 adaptation. Each one builds on embeddings: attention reads the dense vectors and produces context-mixed versions of them; multi-head runs that attention mechanism in parallel many times per layer; the transformer block wraps attention with feed-forward networks and normalization, all operating on the same dense vectors that embeddings produce.
Contextual embeddings (BERT and after). Static token embeddings (the kind this lesson covers) give the same vector to “bank” in “river bank” and “savings bank”. Contextual embeddings, introduced in the BERT line of work in 2018 and 2019, produce a different vector for the same word in different contexts by passing static embeddings through transformer layers. The intuition shift: “the embedding of a word” became “the embedding of a word in this specific context.” Worth understanding if you work with downstream embedding APIs, since modern embedding APIs typically expose contextual representations.
Bias in word embeddings. Word2Vec’s “computer programmer minus man plus woman is approximately homemaker” was the canonical example; the matrix is a faithful mirror of the corpus, biases included. Bolukbasi et al. 2016, “Man is to Computer Programmer as Woman is to Homemaker?” is the foundational paper on debiasing. Worth reading if you build production systems where the bias matters.

Original sources

The primary sources this lesson draws from.

“Efficient Estimation of Word Representations in Vector Space”, Mikolov et al., 2013. The original Word2Vec paper. Introduces the skip-gram and continuous-bag-of-words training objectives that produce the embedding matrix. If you read only one paper from this lesson, read this one.
“Distributed Representations of Words and Phrases and their Compositionality”, Mikolov et al., 2013 (NIPS). The companion paper that introduced the king-minus-man-plus-woman demonstration with worked numbers across many word pairs. The “geometric structure of meaning” claim this lesson rests on is from here.
“GloVe: Global Vectors for Word Representation”, Pennington, Socher, Manning, 2014. The other major static-embedding paper. Where Word2Vec trains on local context windows, GloVe trains on global word-co-occurrence statistics. The embeddings that come out have similar geometric properties but the optimization is different.

Community discussion

None selected for this lesson. The public discussion of word embeddings has consolidated into the Alammar visual post and the academic papers above; the marginal Reddit or Hacker News thread does not add durable value. If a canonical thread surfaces, it will be added at the next quarterly review.