References: Value iteration
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lecture 3: Planning by Dynamic Programming (value iteration section + contraction mapping theorem) Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the value-iteration material inSilver's Lecture 3 and restates it in Clawdemy's voice with a worked A/Bexample deliberately re-using the previous lesson's MDP so the PI-vs-VIcomparison is direct. The explicit greedy-policy-stabilizes-early observation(with the V(B) - V(A) constant computation that proves the policy is fixedfrom iteration 1), the GPI-spectrum placement, and the pre-figure ofQ-learning and DQN as VI with samples / function approximation are Clawdemyframing. Exact per-lecture URLs are verified at promotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming by David Silver. The lecture this lesson and the previous one both draw from, with policy iteration and value iteration developed alongside an explicit treatment of the contraction-mapping theorem. CC BY-NC 4.0, freely available.
Going deeper
Section titled “Going deeper”A short, durable list. Both are free.
- Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 4 (Dynamic Programming), Sections 4.4-4.5. Standard textbook treatment of value iteration, asynchronous DP, and the contraction-based convergence proof, with worked gridworld examples.
- David Silver, UCL RL course, Lecture 4: Model-Free Prediction (within the course above). The opening of Phase 3 in this track: estimating V^pi (and later Q^pi) from samples using Monte Carlo and TD methods. Where the value-iteration update form starts being applied to data instead of the model.
Adjacent topics
Section titled “Adjacent topics”Where this leads inside this track.
- Policy iteration. The previous lesson. The other planning algorithm; VI is essentially PI with one Bellman sweep per “improvement,” and the GPI lens ties them together.
- Monte Carlo prediction. The next lesson and the start of Phase 3 (Model-free learning). The first lesson where you cannot use P and R; you estimate V from complete episode returns.
- Q-learning. Lesson 8. Q-learning is exactly the value-iteration update on Q with the expectation over P replaced by a single sampled transition. Same recursion, sample-based.