Cheatsheet: Building an autograd engine: micrograd
The Value object
Section titled “The Value object”Every number is wrapped in a Value that carries:
| Field | Holds |
|---|---|
| data | the actual number (computed forward) |
| op | the operation that produced it (+, *, tanh, or none for a leaf) |
| children | the input Values it was produced from |
| grad | the derivative of the loss with respect to this value (computed backward; starts at 0) |
Running an expression forward builds a computational graph of these nodes. That is the forward pass.
Local derivatives (known per operation)
Section titled “Local derivatives (known per operation)”| Operation | Local derivative | In words |
|---|---|---|
d = e + c | 1 for each input | addition passes the gradient through unchanged |
e = a * b | b to a, a to b | each input gets the other input’s value |
t = tanh(x) | 1 - tanh(x)^2 | the squashing nonlinearity that makes a neuron |
Backpropagation recipe
Section titled “Backpropagation recipe”- Seed the output:
grad of L = 1(dL/dL = 1). - Walk the graph backward in reverse topological order.
- At each node,
grad of child = (grad of node) × (local derivative). - A node feeding two places sums the gradients arriving from both.
This is the chain rule (rates multiply through a composition) applied node by node. When done, every node holds dL/d(itself).
Worked backward pass
Section titled “Worked backward pass”Leaves a=2, b=-3, c=10, f=-2. Forward:
e = a*b = -6 d = e+c = 4 L = d*f = -8Backward (seed grad L = 1):
grad d = 1·f = -2 grad f = 1·d = 4grad e = grad d · 1 = -2 grad c = grad d · 1 = -2grad a = grad e · b = (-2)(-3) = 6grad b = grad e · a = (-2)(2) = -4So dL/da = 6, dL/db = -4, dL/dc = -2: nudge a up a hair, the loss rises six times as fast.
Why it matters for AI
Section titled “Why it matters for AI”loss.backward() in PyTorch (or JAX, TensorFlow) is exactly this procedure: a recorded graph, a local derivative per op, the chain rule walked backward until every parameter holds its gradient. micrograd is ~150 lines on single numbers; real frameworks are millions on tensors. Same idea. Nothing inside is a mystery.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Autograd is symbolic calculus. No, it propagates numbers backward through the graph; no formula is derived.
- Multiply derivative uses the same input. No,
grad of afore=a*busesb, the other factor. - Skipping reverse topological order. A node must receive all its incoming gradient before passing any down.
- Confusing data with grad. A node holds both: data (forward) and grad (backward).
ahas data 2, grad 6.
The one-line version
Section titled “The one-line version”An autograd engine records the computation as a graph, knows each operation’s local derivative, and walks the chain rule backward through the graph so every input ends up holding its gradient.