Skip to content

References: the MLP language model

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 3:
"Building makemore Part 2: MLP"
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=TCH_1BHY58I
Code repo (makemore): https://github.com/karpathy/makemore (MIT License)
Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License)
Series page: https://karpathy.ai/zero-to-hero.html
License: makemore and the series code are MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 3, where Karpathy rebuilds makemore as a multilayer
perceptron with learned character embeddings, following Bengio et al. (2003).
Clawdemy's lessons are original prose following the pedagogical arc of this
series; we do not reproduce or transcribe the video or code. The parameter
count and the embed-and-concatenate example here are ours, built to be checkable
by hand. All rights to the original video and code remain with the creator.
  • Building makemore Part 2: MLP (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds the embedding table, the hidden layer, and the output layer; trains with minibatches; finds a good learning rate by sweeping; and adds the train/dev/test split. The standout moment is plotting the learned 2-dimensional embeddings and seeing the vowels cluster together, the clearest possible picture of what “the model learned a useful representation” means.

Where this sits in the curriculum.

  • The previous lesson (the bigram model). This lesson directly answers the bigram model’s weakness (one character of context) and reuses its core: characters, a probability over the next one, softmax, and negative log likelihood. If the “predict the next character” framing feels fast, that lesson is the grounding.

  • Embeddings (AI Foundations track). The learned per-character vectors here are the same idea as the word and token embeddings the AI Foundations track covers: similar items get nearby vectors, and arithmetic on those vectors can capture relationships. That track gives the intuition; this lesson shows where the vectors come from (they are trained, jointly with the rest of the model).