Skip to content

Summary: Temporal-difference learning

TD(0) updates V at every step using a bootstrapped one-step target: the observed reward plus the current estimate of the next state’s value. That single change from MC trades a small bias for a large variance reduction, gives you online learning, and works on continuing tasks where MC cannot. TD is the foundation under Q-learning, SARSA, DQN, and actor-critic. This summary is the scan-in-five-minutes version of the full lesson.

  • The TD(0) update. V(s_t) <- V(s_t) + alpha * [ r_(t+1) + gamma * V(s_(t+1)) - V(s_t) ]. The bracketed expression is the TD error delta_t. The target r_(t+1) + gamma * V(s_(t+1)) is the Bellman expectation equation estimated from one sampled transition.
  • Bootstrapping. TD uses V(s_(t+1)), an estimate, to update V(s_t), another estimate. The reward r_(t+1) is real data; the future-value piece is a current estimate being learned. That is what “bootstrap” means here.
  • Worked on a deterministic A->B->C chain. Reward 1 each step, gamma = 1, alpha = 0.5, V_0 = 0. V(A) climbs 0 -> 0.5 -> 1.0 -> 1.375 -> 1.625 across four episodes; V(B) climbs 0 -> 0.5 -> 0.75 -> 0.875 -> 0.9375; both creep toward V^pi = (2, 1, 0). Value propagates backward from the terminal one bootstrap per episode.
  • MC vs TD on the bias-variance axis. MC is the unbiased, high-variance extreme (target = full return). TD(0) is the biased, low-variance extreme (target = one reward + bootstrap). n-step returns and TD(lambda) interpolate. Both converge to V^pi.
  • Online + continuing tasks. TD needs only one transition, so updates happen as data arrives and the algorithm runs on non-terminating problems. MC, which waits for a full return, does neither.
  • Foundation under modern model-free RL. SARSA and Q-learning are TD on Q (lesson 8). DQN trains a neural network to match a TD target (lesson 9). Actor-critic uses a TD-trained critic to guide a policy network. Deadly triad (TD + off-policy + function approximation) can diverge; DQN’s experience replay and target networks tame it.

You now have the algorithm that almost every deployed model-free RL system uses under the hood, in the simplest form. The most useful mental model is the bias-variance lever: MC at one end, TD at the other, n-step and TD(lambda) in between, and the choice of algorithm is really a choice of where on that lever you want to sit for your problem. The second most useful idea is value propagates backward, the bootstrap visibly carries the terminal information one state per pass on the deterministic chain, which is also why deep RL tends to need many environment steps to learn good values on long-horizon problems. Next lesson takes the V-bootstrap and applies it to Q with a max over actions baked in, producing Q-learning, the canonical model-free control algorithm.