References: the MLP language model

Source material

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 3:
  "Building makemore Part 2: MLP"
  Creator: Andrej Karpathy
  Video: https://www.youtube.com/watch?v=TCH_1BHY58I
  Code repo (makemore): https://github.com/karpathy/makemore (MIT License)
  Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License)
  Series page: https://karpathy.ai/zero-to-hero.html
  License: makemore and the series code are MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 3, where Karpathy rebuilds makemore as a multilayer
perceptron with learned character embeddings, following Bengio et al. (2003).
Clawdemy's lessons are original prose following the pedagogical arc of this
series; we do not reproduce or transcribe the video or code. The parameter
count and the embed-and-concatenate example here are ours, built to be checkable
by hand. All rights to the original video and code remain with the creator.

Watch this next

Building makemore Part 2: MLP (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds the embedding table, the hidden layer, and the output layer; trains with minibatches; finds a good learning rate by sweeping; and adds the train/dev/test split. The standout moment is plotting the learned 2-dimensional embeddings and seeing the vowels cluster together, the clearest possible picture of what “the model learned a useful representation” means.

Going deeper

A Neural Probabilistic Language Model (Bengio, Ducharme, Vincent, Jauvin, 2003) (PDF). The original paper this architecture comes from. It introduced the idea of learning distributed word representations (embeddings) jointly with the language model, two decades before the current wave. Worth a skim to see how early the core idea is.
makemore on GitHub (MIT License). The project as it grows across the lectures. The MLP version is the part to read after this lesson.
Neural Networks: Zero to Hero (full series) and its code repo by Andrej Karpathy. The next lecture looks inside this MLP while it trains and fixes what makes deeper networks hard to train.

Adjacent topics

Where this sits in the curriculum.

The previous lesson (the bigram model). This lesson directly answers the bigram model’s weakness (one character of context) and reuses its core: characters, a probability over the next one, softmax, and negative log likelihood. If the “predict the next character” framing feels fast, that lesson is the grounding.
Embeddings (AI Foundations track). The learned per-character vectors here are the same idea as the word and token embeddings the AI Foundations track covers: similar items get nearby vectors, and arithmetic on those vectors can capture relationships. That track gives the intuition; this lesson shows where the vectors come from (they are trained, jointly with the rest of the model).