Temporal-difference learning: cheatsheet

The one idea

TD(0) updates V each step with a one-step bootstrapped target: reward + gamma * V(next state). Biased but low-variance; works online and on continuing tasks; the foundation of Q-learning and modern deep RL.

The TD(0) update

TD(0):  V(s_t)  <-  V(s_t)  +  alpha * [ r_(t+1) + gamma * V(s_(t+1)) - V(s_t) ]

TD ERROR (delta_t):  r_(t+1) + gamma * V(s_(t+1)) - V(s_t)
TARGET:              r_(t+1) + gamma * V(s_(t+1))      (Bellman expectation, sample-estimated)
                     real reward + bootstrapped future value

Compare with MC: V(s_t) <- V(s_t) + alpha * [ G_t - V(s_t) ]   (full return, no bootstrap)

Worked example (deterministic chain, gamma = 1, alpha = 0.5)

A -> B (R=1) -> C (R=1, terminal).   V_0 = (0, 0, 0).   True V = (2, 1, 0).

Ep 1:  A->B: delta=1   V(A) = 0.5    B->C: delta=1   V(B) = 0.5    -> V = (0.5, 0.5, 0)
Ep 2:  A->B: delta=1   V(A) = 1.0    B->C: delta=0.5 V(B) = 0.75   -> V = (1.0, 0.75, 0)
Ep 3:  A->B: delta=0.75 V(A) = 1.375 B->C: delta=0.25 V(B) = 0.875  -> V = (1.375, 0.875, 0)
Ep 4:  A->B: delta=0.5  V(A) = 1.625 B->C: delta=0.125 V(B) = 0.9375 -> V = (1.625, 0.9375, 0)

Value propagates BACKWARD from terminal: B catches up first; A is one bootstrap behind.

MC vs TD (the bias-variance spectrum)

Aspect	MC	TD(0)
Target	Full return G_t	r_(t+1) + gamma * V(s_(t+1)) (one step + bootstrap)
Bias	Zero	Some (bootstrap from estimate)
Variance	High (long random sum)	Low (one reward + one estimate)
Needs termination	YES	NO
Online updates	Episode-end	EVERY transition
Continuing tasks	NO	YES

n-step returns and TD(lambda) interpolate between MC and TD(0) on the same axis.

In modern RL

SARSA       : TD on Q, on-policy target  r + gamma * Q(s', a')
Q-learning  : TD on Q, off-policy target r + gamma * max_a' Q(s', a')   (next lesson)
DQN         : Q-learning with NN approximator + experience replay + target net
Actor-critic: policy network + TD-trained critic

The deadly triad

TD bootstrap + OFF-POLICY learning + FUNCTION APPROXIMATION
=> CAN DIVERGE in naive implementations.
Fixes (DQN): experience replay buffer + slowly-updated target network.
Naive deep TD/Q-learning without these tricks is fragile (lesson 9).

Pitfalls to dodge

Confusing the TD target with the MC return (one step + bootstrap vs full return).
Reading “bootstrap” as “no real data” (the reward is real; only the future-value part is bootstrapped).
Choosing alpha too large (oscillation/divergence under stochasticity).
Reading TD’s bias as “wrong” (asymptotic convergence to V^pi; bias-variance trade usually favors TD).
Ignoring the deadly triad with function approximation.

Words to use precisely

TD target: r_(t+1) + gamma * V(s_(t+1)); one-step Bellman estimate.
TD error delta_t: target minus current V(s_t); drives the update.
Bootstrap: using a current estimate to update another estimate.
n-step / TD(lambda): targets between TD(0) and full MC on the bias-variance axis.
Deadly triad: bootstrap + off-policy + function approximation; potential divergence.