Skip to content

References: Temporal-difference learning

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 4:
Model-Free Prediction (temporal-difference learning)
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the TD portion of Silver's Lecture 4
and restates it in Clawdemy's voice with an original deterministic A->B->C
chain worked example, designed so the bootstrap-propagates-backward mechanism
is visible without the additional noise of stochastic transitions. The
explicit MC-vs-TD bias-variance table, the value-propagates-backward
observation, and the early naming of the deadly triad (bootstrap + off-policy
+ function approximation) as the reason naive deep RL is fragile (with the
DQN fixes forward-pointed to lesson 9) are Clawdemy framing. n-step returns
and TD(lambda) are named as interpolation points on the spectrum but not
developed in detail. Exact per-lecture URLs are verified at promotion.
  • David Silver, UCL RL course, Lecture 4: Model-Free Prediction by David Silver. The lecture this lesson and the previous one both draw from, with MC and TD developed in tandem so the bias-variance trade-off is direct. The n-step and TD(lambda) material is also developed here. CC BY-NC 4.0, freely available.

A short, durable list. Both are free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 6 (Temporal-Difference Learning) and Chapter 7 (n-step Bootstrapping). The standard textbook treatment, with convergence proofs for tabular TD, the SARSA control algorithm, and the n-step interpolation between TD(0) and Monte Carlo.
  • David Silver, UCL RL course, Lecture 5: Model-Free Control (within the course above). The natural continuation: wrapping TD-style prediction into a GPI control loop. SARSA and Q-learning come out of this material; Track 17 lesson 8 develops Q-learning from it.

Where this leads inside this track.

  • Monte Carlo prediction. The previous lesson. The unbiased / high-variance end of the spectrum TD sits at the other end of.
  • Q-learning: model-free control. The next lesson. Q-learning is the TD update applied to action-values Q, with a max-over-actions in the target (off-policy). The same bootstrap mechanism with one extra ingredient.
  • Function approximation and deep RL. Lesson 9. When V (or Q) is approximated by a neural network rather than stored in a table, the TD update becomes a regression to a Bellman target — and the deadly triad named here becomes the engineering problem DQN’s tricks solve.