Skip to content

References: Value iteration

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 3:
Planning by Dynamic Programming (value iteration section + contraction
mapping theorem)
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the value-iteration material in
Silver's Lecture 3 and restates it in Clawdemy's voice with a worked A/B
example deliberately re-using the previous lesson's MDP so the PI-vs-VI
comparison is direct. The explicit greedy-policy-stabilizes-early observation
(with the V(B) - V(A) constant computation that proves the policy is fixed
from iteration 1), the GPI-spectrum placement, and the pre-figure of
Q-learning and DQN as VI with samples / function approximation are Clawdemy
framing. Exact per-lecture URLs are verified at promotion.

A short, durable list. Both are free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 4 (Dynamic Programming), Sections 4.4-4.5. Standard textbook treatment of value iteration, asynchronous DP, and the contraction-based convergence proof, with worked gridworld examples.
  • David Silver, UCL RL course, Lecture 4: Model-Free Prediction (within the course above). The opening of Phase 3 in this track: estimating V^pi (and later Q^pi) from samples using Monte Carlo and TD methods. Where the value-iteration update form starts being applied to data instead of the model.

Where this leads inside this track.

  • Policy iteration. The previous lesson. The other planning algorithm; VI is essentially PI with one Bellman sweep per “improvement,” and the GPI lens ties them together.
  • Monte Carlo prediction. The next lesson and the start of Phase 3 (Model-free learning). The first lesson where you cannot use P and R; you estimate V from complete episode returns.
  • Q-learning. Lesson 8. Q-learning is exactly the value-iteration update on Q with the expectation over P replaced by a single sampled transition. Same recursion, sample-based.