References: the MLP language model
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 3: "Building makemore Part 2: MLP" Creator: Andrej Karpathy Video: https://www.youtube.com/watch?v=TCH_1BHY58I Code repo (makemore): https://github.com/karpathy/makemore (MIT License) Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License) Series page: https://karpathy.ai/zero-to-hero.html License: makemore and the series code are MIT-licensed; the video is YouTube standard.This lesson covers Lecture 3, where Karpathy rebuilds makemore as a multilayerperceptron with learned character embeddings, following Bengio et al. (2003).Clawdemy's lessons are original prose following the pedagogical arc of thisseries; we do not reproduce or transcribe the video or code. The parametercount and the embed-and-concatenate example here are ours, built to be checkableby hand. All rights to the original video and code remain with the creator.Watch this next
Section titled “Watch this next”- Building makemore Part 2: MLP (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds the embedding table, the hidden layer, and the output layer; trains with minibatches; finds a good learning rate by sweeping; and adds the train/dev/test split. The standout moment is plotting the learned 2-dimensional embeddings and seeing the vowels cluster together, the clearest possible picture of what “the model learned a useful representation” means.
Going deeper
Section titled “Going deeper”-
A Neural Probabilistic Language Model (Bengio, Ducharme, Vincent, Jauvin, 2003) (PDF). The original paper this architecture comes from. It introduced the idea of learning distributed word representations (embeddings) jointly with the language model, two decades before the current wave. Worth a skim to see how early the core idea is.
-
makemore on GitHub (MIT License). The project as it grows across the lectures. The MLP version is the part to read after this lesson.
-
Neural Networks: Zero to Hero (full series) and its code repo by Andrej Karpathy. The next lecture looks inside this MLP while it trains and fixes what makes deeper networks hard to train.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the curriculum.
-
The previous lesson (the bigram model). This lesson directly answers the bigram model’s weakness (one character of context) and reuses its core: characters, a probability over the next one, softmax, and negative log likelihood. If the “predict the next character” framing feels fast, that lesson is the grounding.
-
Embeddings (AI Foundations track). The learned per-character vectors here are the same idea as the word and token embeddings the AI Foundations track covers: similar items get nearby vectors, and arithmetic on those vectors can capture relationships. That track gives the intuition; this lesson shows where the vectors come from (they are trained, jointly with the rest of the model).