Value iteration: brief

What you’ll learn

This is lesson 5 of Track 17 (Reinforcement Learning Foundations) and the close of Phase 2 (Planning with a known model). The previous lesson built policy iteration, which alternated full evaluation with greedy improvement. Value iteration is the simpler sibling: it interleaves the two completely by applying the Bellman optimality equation as a direct update, sweep after sweep, with no explicit policy until you ask for one. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson writes the value-iteration update, explains why it converges (the Bellman optimality operator is a contraction in the sup-norm with constant gamma, giving geometric convergence at rate gamma), runs four iterations on the same two-state MDP as the previous lesson so the comparison is direct, shows that the greedy policy is already pi^* from the first iteration even though V is still creeping toward V^*, places value iteration at the extreme of the generalized-policy-iteration spectrum, and previews how the same update reappears in Q-learning (with samples) and DQN (with function approximation).

Where this fits

This is lesson 5 of 10, the final lesson of Phase 2. It uses everything from the Bellman-equation lesson (lesson 3) and the policy-iteration lesson (lesson 4), and it closes the planning phase. Phase 3 (lessons 6-8) re-does the same evaluate-and-improve idea from samples, with Monte Carlo and TD methods replacing the model-based evaluation step. Q-learning specifically is value iteration’s update form applied sample-by-sample.

Before you start

Prerequisites: the previous lesson (Policy iteration) and lesson 3 (Value functions and the Bellman equations). Comfort with the Bellman optimality equation, sums over next states, and max over actions. Small arithmetic and the contraction-mapping intuition (a fact, not a proof in this lesson) are all that is needed.

About the math

The arithmetic is hand-sized: at each iteration you compute, for each state, the Q values for every action (immediate reward plus discounted expected next-state value) and take the maximum. The A/B worked example runs four iterations explicitly; the practice runs a fresh MDP for three iterations. The contraction-mapping fact (||T^* U - T^* V||_inf <= gamma * ||U - V||_inf) is stated and used as the reason convergence is geometric; the proof is left to the textbook.

By the end, you’ll be able to

Write the value-iteration update as a direct sweep of the Bellman optimality equation
Run value iteration through several iterations on a small MDP
Explain why the Bellman optimality operator is a contraction with rate gamma and why that guarantees convergence to V^*
Recognize that the greedy policy often stabilizes well before V converges, and use that as a stopping rule
Place value iteration as the extreme case of generalized policy iteration (one Bellman sweep per “improvement”)

Time and difficulty

Read time: about 13 minutes
Practice time: about 16 minutes (a self-check, a run-value-iteration exercise on a fresh small MDP, a stopping-rule and PI-vs-VI comparison drill, and flashcards)
Difficulty: standard (concrete arithmetic, every iteration shown; the contraction property is stated as a fact, not derived)