Cheatsheet: Temporal-difference learning
The one idea
Section titled “The one idea”TD(0) updates V each step with a one-step bootstrapped target: reward + gamma * V(next state). Biased but low-variance; works online and on continuing tasks; the foundation of Q-learning and modern deep RL.
The TD(0) update
Section titled “The TD(0) update”TD(0): V(s_t) <- V(s_t) + alpha * [ r_(t+1) + gamma * V(s_(t+1)) - V(s_t) ]
TD ERROR (delta_t): r_(t+1) + gamma * V(s_(t+1)) - V(s_t)TARGET: r_(t+1) + gamma * V(s_(t+1)) (Bellman expectation, sample-estimated) real reward + bootstrapped future value
Compare with MC: V(s_t) <- V(s_t) + alpha * [ G_t - V(s_t) ] (full return, no bootstrap)Worked example (deterministic chain, gamma = 1, alpha = 0.5)
Section titled “Worked example (deterministic chain, gamma = 1, alpha = 0.5)”A -> B (R=1) -> C (R=1, terminal). V_0 = (0, 0, 0). True V = (2, 1, 0).
Ep 1: A->B: delta=1 V(A) = 0.5 B->C: delta=1 V(B) = 0.5 -> V = (0.5, 0.5, 0)Ep 2: A->B: delta=1 V(A) = 1.0 B->C: delta=0.5 V(B) = 0.75 -> V = (1.0, 0.75, 0)Ep 3: A->B: delta=0.75 V(A) = 1.375 B->C: delta=0.25 V(B) = 0.875 -> V = (1.375, 0.875, 0)Ep 4: A->B: delta=0.5 V(A) = 1.625 B->C: delta=0.125 V(B) = 0.9375 -> V = (1.625, 0.9375, 0)
Value propagates BACKWARD from terminal: B catches up first; A is one bootstrap behind.MC vs TD (the bias-variance spectrum)
Section titled “MC vs TD (the bias-variance spectrum)”| Aspect | MC | TD(0) |
|---|---|---|
| Target | Full return G_t | r_(t+1) + gamma * V(s_(t+1)) (one step + bootstrap) |
| Bias | Zero | Some (bootstrap from estimate) |
| Variance | High (long random sum) | Low (one reward + one estimate) |
| Needs termination | YES | NO |
| Online updates | Episode-end | EVERY transition |
| Continuing tasks | NO | YES |
n-step returns and TD(lambda) interpolate between MC and TD(0) on the same axis.
In modern RL
Section titled “In modern RL”SARSA : TD on Q, on-policy target r + gamma * Q(s', a')Q-learning : TD on Q, off-policy target r + gamma * max_a' Q(s', a') (next lesson)DQN : Q-learning with NN approximator + experience replay + target netActor-critic: policy network + TD-trained criticThe deadly triad
Section titled “The deadly triad”TD bootstrap + OFF-POLICY learning + FUNCTION APPROXIMATION=> CAN DIVERGE in naive implementations.Fixes (DQN): experience replay buffer + slowly-updated target network.Naive deep TD/Q-learning without these tricks is fragile (lesson 9).Pitfalls to dodge
Section titled “Pitfalls to dodge”- Confusing the TD target with the MC return (one step + bootstrap vs full return).
- Reading “bootstrap” as “no real data” (the reward is real; only the future-value part is bootstrapped).
- Choosing alpha too large (oscillation/divergence under stochasticity).
- Reading TD’s bias as “wrong” (asymptotic convergence to V^pi; bias-variance trade usually favors TD).
- Ignoring the deadly triad with function approximation.
Words to use precisely
Section titled “Words to use precisely”- TD target: r_(t+1) + gamma * V(s_(t+1)); one-step Bellman estimate.
- TD error delta_t: target minus current V(s_t); drives the update.
- Bootstrap: using a current estimate to update another estimate.
- n-step / TD(lambda): targets between TD(0) and full MC on the bias-variance axis.
- Deadly triad: bootstrap + off-policy + function approximation; potential divergence.