Skip to content

Temporal-difference learning

This is lesson 7 of Track 17 (Reinforcement Learning Foundations) and the second lesson of Phase 3 (Model-free learning). The previous lesson covered Monte Carlo prediction; this lesson develops its complement, temporal-difference learning. TD(0) makes one small change to the MC update, replacing the full return with a one-step bootstrapped target, and that single change buys online learning, support for continuing tasks, and dramatically lower variance, at the cost of some bias. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson writes the TD(0) update and the TD error, walks four episodes of a deterministic A->B->C chain to show value visibly propagating backward from the terminal state one bootstrap per pass, sets out the MC vs TD bias-variance table side by side, names the practical wins (online + continuing-task support), and places TD as the foundation under Q-learning (next lesson), SARSA, DQN, and actor-critic. It also names the deadly triad (bootstrap + off-policy + function approximation) as the reason naive deep value-learning is fragile, with the DQN fixes forward-pointed to lesson 9.

This is lesson 7 of 10, in the middle of Phase 3. It is the second of the two prediction lessons in this phase (MC was the first); together they sit at the two ends of the bias-variance spectrum. The next lesson, Q-learning, lifts this same one-step TD update from V to Q with a max-over-actions in the target, giving the canonical model-free control algorithm. Lesson 9 replaces the tabular V (or Q) with a function approximator, which is where the deadly triad named here becomes the engineering problem DQN solves.

Prerequisites: the previous lesson (Monte Carlo prediction) for the prediction problem and the MC update form to compare against; lesson 3 (Value functions and the Bellman equations) for the Bellman expectation equation the TD target estimates. Comfort with a single-step update of the form V <- V + alpha * (target - V) is the only computational background needed.

The arithmetic is hand-sized: at each transition, compute the TD error (one reward + gamma times one V-estimate minus one V-estimate), then nudge V by alpha times the TD error. The 4-episode A->B->C worked example shows every delta and every update; the practice runs a 4-state chain through three episodes. No proofs; the contraction-style convergence argument that backs TD is left to the textbook.

  • Write the TD(0) update for V using the one-step bootstrapped target r_(t+1) + gamma * V(s_(t+1))
  • Define the TD error and explain it as the difference between target and current estimate
  • Run TD(0) through several episodes on a small deterministic MDP and observe monotonic convergence
  • Compare MC and TD on the bias-variance axis (MC unbiased / high-variance, TD biased / low-variance)
  • Recognize that TD supports continuing tasks and online learning, and that it is the foundation of Q-learning and modern model-free RL
  • Read time: about 13 minutes
  • Practice time: about 16 minutes (a self-check, a TD(0) trace on a fresh 4-state chain, an MC vs TD comparison drill on the same setting, and flashcards)
  • Difficulty: standard (small arithmetic per step; conceptual challenge is internalizing the bootstrap as “estimate updates estimate” and the bias-variance trade with MC)