References: Temporal-difference learning
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lecture 4: Model-Free Prediction (temporal-difference learning) Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the TD portion of Silver's Lecture 4and restates it in Clawdemy's voice with an original deterministic A->B->Cchain worked example, designed so the bootstrap-propagates-backward mechanismis visible without the additional noise of stochastic transitions. Theexplicit MC-vs-TD bias-variance table, the value-propagates-backwardobservation, and the early naming of the deadly triad (bootstrap + off-policy+ function approximation) as the reason naive deep RL is fragile (with theDQN fixes forward-pointed to lesson 9) are Clawdemy framing. n-step returnsand TD(lambda) are named as interpolation points on the spectrum but notdeveloped in detail. Exact per-lecture URLs are verified at promotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lecture 4: Model-Free Prediction by David Silver. The lecture this lesson and the previous one both draw from, with MC and TD developed in tandem so the bias-variance trade-off is direct. The n-step and TD(lambda) material is also developed here. CC BY-NC 4.0, freely available.
Going deeper
Section titled “Going deeper”A short, durable list. Both are free.
- Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 6 (Temporal-Difference Learning) and Chapter 7 (n-step Bootstrapping). The standard textbook treatment, with convergence proofs for tabular TD, the SARSA control algorithm, and the n-step interpolation between TD(0) and Monte Carlo.
- David Silver, UCL RL course, Lecture 5: Model-Free Control (within the course above). The natural continuation: wrapping TD-style prediction into a GPI control loop. SARSA and Q-learning come out of this material; Track 17 lesson 8 develops Q-learning from it.
Adjacent topics
Section titled “Adjacent topics”Where this leads inside this track.
- Monte Carlo prediction. The previous lesson. The unbiased / high-variance end of the spectrum TD sits at the other end of.
- Q-learning: model-free control. The next lesson. Q-learning is the TD update applied to action-values Q, with a max-over-actions in the target (off-policy). The same bootstrap mechanism with one extra ingredient.
- Function approximation and deep RL. Lesson 9. When V (or Q) is approximated by a neural network rather than stored in a table, the TD update becomes a regression to a Bellman target — and the deadly triad named here becomes the engineering problem DQN’s tricks solve.