Skip to content

References: makemore, the bigram model

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 2:
"The spelled-out intro to language modeling: building makemore"
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=PaCmpygFfXo
Code repo (makemore): https://github.com/karpathy/makemore (MIT License)
Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License)
Series page: https://karpathy.ai/zero-to-hero.html
License: makemore and the series code are MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 2, where Karpathy builds the bigram model both by
counting and as a one-layer neural network, and shows the two agree. Clawdemy's
lessons are original prose following the pedagogical arc of this series; we do
not reproduce or transcribe the video or code. The worked count/NLL example
here is ours, built to be checkable by hand. All rights to the original video
and code remain with the creator.
  • The spelled-out intro to language modeling: building makemore (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy builds the count table, generates the first (bad) names, defines the negative log likelihood loss, then rebuilds the same model as a one-layer network and trains it until the two agree. Watching the generated names go from gibberish to roughly name-like, and seeing the trained network’s probabilities match the counts, is the clearest way to make the two-routes-one-answer idea concrete.
  • makemore on GitHub (MIT License). The full project, which grows over the next several lectures from this bigram model up to a transformer. The names.txt dataset and the bigram code are the parts to read after this lesson.

  • Neural Networks: Zero to Hero (full series) and its code repo by Andrej Karpathy. The series this track follows. The next lecture extends makemore from one character of context to several, through a multilayer perceptron with learned character embeddings.

Where this sits in the curriculum.

  • The previous lessons (the autograd engine, building and training a net). The neural-network version of the bigram model is trained with exactly the engine and the gradient-descent loop from Phase 1: one-hot input, a single linear layer, a loss, backward(), and a downhill step. If the training half felt fast, those two lessons are the grounding.

  • How AI reads tokens (AI Foundations track). This lesson works at the character level for clarity; real language models work with tokens (chunks of text). The AI Foundations treatment of tokenization is the bridge from “one character at a time” to “one token at a time,” and the final lesson of this track builds a tokenizer from scratch.