Summary: Value iteration
Value iteration is the simpler sibling of policy iteration: just apply the Bellman optimality equation as a direct update, sweep after sweep, until V (or the greedy policy) stops moving. No explicit policy maintained, no inner evaluation loop. It is the GPI extreme (one Bellman sweep per “improvement”) and the cleanest pre-figure of the model-free methods coming in Phase 3. This summary is the scan-in-five-minutes version of the full lesson.
Core ideas
Section titled “Core ideas”- The update. V_(k+1)(s) = max over a of [ R(s, a) + gamma * sum over s’ of P(s’ | s, a) V_k(s’) ]. The Bellman optimality equation as an assignment, swept over all states, repeated.
- Why it converges. The Bellman optimality operator T^* is a contraction in the sup-norm with constant gamma: || T^* U - T^* V ||_inf
<=gamma * || U - V ||_inf. So V_k -> V^* geometrically at rate gamma, from any starting V_0. Each iteration cuts the worst-case error by a factor of gamma. - The greedy policy stabilizes early. The implicit policy at iteration k is greedy w.r.t. V_k. It typically settles to pi^* long before V finishes converging numerically. Stopping VI when the greedy policy stops changing is a standard, often-large speedup.
- Worked on A/B at gamma = 0.9. Starting V_0 = (0, 0) the sweeps give V_1 = (1, 2), V_2 = (1.9, 2.9), V_3 = (2.71, 3.71), V_4 = (3.439, 4.439), creeping toward V^* = (10, 11). The greedy policy is (stay, switch) = pi^ from iteration 1*.
- VI as the GPI extreme. Three points on the same evaluate-then-improve spectrum: full PI (many evaluation sweeps per improvement), truncated PI (a few), and VI (exactly one). All converge to the same V^* and pi^*; the trade-off is cost-per-step against number-of-steps.
- Pre-figures the rest of the track. Q-learning is VI on Q with the expectation replaced by a sample. DQN is the same update with the table replaced by a neural network. The recursion you iterate here is the one those methods estimate.
What changes for you
Section titled “What changes for you”You now have the second (and simpler) of the two planning algorithms, and a habit you will keep using: iterate the Bellman update, and let the policy fall out as the greedy action. The early-stopping trick is the most practically useful piece, the greedy policy often settles in a handful of sweeps even when V’s last few digits keep crawling, so you do not need to wait for V to converge to act optimally. The bigger payoff is structural: the VI update is the shape of every later algorithm in the track, with the expectation either summed exactly (planning, Phase 2) or estimated from samples (model-free learning, Phase 3) and the value either tabular (here) or function-approximated (Phase 4). One recursion, three settings.