References: Value iteration

Source material

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 3:
  Planning by Dynamic Programming (value iteration section + contraction
  mapping theorem)
  Author: David Silver
  Course page: https://davidstarsilver.wordpress.com/teaching/
  License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause is now consistent with Clawdemy's own CC BY-NC-SA 4.0 license; both forbid commercial use without permission. Commercial use is licensed separately at [/legal/licensing](/legal/licensing/).
All rights to the original materials remain with the author and UCL.

Source-scope note: this lesson mirrors the value-iteration material in
Silver's Lecture 3 and restates it in Clawdemy's voice with a worked A/B
example deliberately re-using the previous lesson's MDP so the PI-vs-VI
comparison is direct. The explicit greedy-policy-stabilizes-early observation
(with the V(B) - V(A) constant computation that proves the policy is fixed
from iteration 1), the GPI-spectrum placement, and the pre-figure of
Q-learning and DQN as VI with samples / function approximation are Clawdemy
framing. Exact per-lecture URLs are verified at promotion.

Read this next

David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming by David Silver. The lecture this lesson and the previous one both draw from, with policy iteration and value iteration developed alongside an explicit treatment of the contraction-mapping theorem. CC BY-NC 4.0, freely available.

Going deeper

A short, durable list. Both are free.

Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 4 (Dynamic Programming), Sections 4.4-4.5. Standard textbook treatment of value iteration, asynchronous DP, and the contraction-based convergence proof, with worked gridworld examples.
David Silver, UCL RL course, Lecture 4: Model-Free Prediction (within the course above). The opening of Phase 3 in this track: estimating V^pi (and later Q^pi) from samples using Monte Carlo and TD methods. Where the value-iteration update form starts being applied to data instead of the model.

Adjacent topics

Where this leads inside this track.

Policy iteration. The previous lesson. The other planning algorithm; VI is essentially PI with one Bellman sweep per “improvement,” and the GPI lens ties them together.
Monte Carlo prediction. The next lesson and the start of Phase 3 (Model-free learning). The first lesson where you cannot use P and R; you estimate V from complete episode returns.
Q-learning. Lesson 8. Q-learning is exactly the value-iteration update on Q with the expectation over P replaced by a single sampled transition. Same recursion, sample-based.