Skip to content

Cheatsheet: Temporal-difference learning

TD(0) updates V each step with a one-step bootstrapped target: reward + gamma * V(next state). Biased but low-variance; works online and on continuing tasks; the foundation of Q-learning and modern deep RL.

TD(0): V(s_t) <- V(s_t) + alpha * [ r_(t+1) + gamma * V(s_(t+1)) - V(s_t) ]
TD ERROR (delta_t): r_(t+1) + gamma * V(s_(t+1)) - V(s_t)
TARGET: r_(t+1) + gamma * V(s_(t+1)) (Bellman expectation, sample-estimated)
real reward + bootstrapped future value
Compare with MC: V(s_t) <- V(s_t) + alpha * [ G_t - V(s_t) ] (full return, no bootstrap)

Worked example (deterministic chain, gamma = 1, alpha = 0.5)

Section titled “Worked example (deterministic chain, gamma = 1, alpha = 0.5)”
A -> B (R=1) -> C (R=1, terminal). V_0 = (0, 0, 0). True V = (2, 1, 0).
Ep 1: A->B: delta=1 V(A) = 0.5 B->C: delta=1 V(B) = 0.5 -> V = (0.5, 0.5, 0)
Ep 2: A->B: delta=1 V(A) = 1.0 B->C: delta=0.5 V(B) = 0.75 -> V = (1.0, 0.75, 0)
Ep 3: A->B: delta=0.75 V(A) = 1.375 B->C: delta=0.25 V(B) = 0.875 -> V = (1.375, 0.875, 0)
Ep 4: A->B: delta=0.5 V(A) = 1.625 B->C: delta=0.125 V(B) = 0.9375 -> V = (1.625, 0.9375, 0)
Value propagates BACKWARD from terminal: B catches up first; A is one bootstrap behind.
AspectMCTD(0)
TargetFull return G_tr_(t+1) + gamma * V(s_(t+1)) (one step + bootstrap)
BiasZeroSome (bootstrap from estimate)
VarianceHigh (long random sum)Low (one reward + one estimate)
Needs terminationYESNO
Online updatesEpisode-endEVERY transition
Continuing tasksNOYES

n-step returns and TD(lambda) interpolate between MC and TD(0) on the same axis.

SARSA : TD on Q, on-policy target r + gamma * Q(s', a')
Q-learning : TD on Q, off-policy target r + gamma * max_a' Q(s', a') (next lesson)
DQN : Q-learning with NN approximator + experience replay + target net
Actor-critic: policy network + TD-trained critic
TD bootstrap + OFF-POLICY learning + FUNCTION APPROXIMATION
=> CAN DIVERGE in naive implementations.
Fixes (DQN): experience replay buffer + slowly-updated target network.
Naive deep TD/Q-learning without these tricks is fragile (lesson 9).
  • Confusing the TD target with the MC return (one step + bootstrap vs full return).
  • Reading “bootstrap” as “no real data” (the reward is real; only the future-value part is bootstrapped).
  • Choosing alpha too large (oscillation/divergence under stochasticity).
  • Reading TD’s bias as “wrong” (asymptotic convergence to V^pi; bias-variance trade usually favors TD).
  • Ignoring the deadly triad with function approximation.
  • TD target: r_(t+1) + gamma * V(s_(t+1)); one-step Bellman estimate.
  • TD error delta_t: target minus current V(s_t); drives the update.
  • Bootstrap: using a current estimate to update another estimate.
  • n-step / TD(lambda): targets between TD(0) and full MC on the bias-variance axis.
  • Deadly triad: bootstrap + off-policy + function approximation; potential divergence.