Skip to content

Summary: Value iteration

Value iteration is the simpler sibling of policy iteration: just apply the Bellman optimality equation as a direct update, sweep after sweep, until V (or the greedy policy) stops moving. No explicit policy maintained, no inner evaluation loop. It is the GPI extreme (one Bellman sweep per “improvement”) and the cleanest pre-figure of the model-free methods coming in Phase 3. This summary is the scan-in-five-minutes version of the full lesson.

  • The update. V_(k+1)(s) = max over a of [ R(s, a) + gamma * sum over s’ of P(s’ | s, a) V_k(s’) ]. The Bellman optimality equation as an assignment, swept over all states, repeated.
  • Why it converges. The Bellman optimality operator T^* is a contraction in the sup-norm with constant gamma: || T^* U - T^* V ||_inf <= gamma * || U - V ||_inf. So V_k -> V^* geometrically at rate gamma, from any starting V_0. Each iteration cuts the worst-case error by a factor of gamma.
  • The greedy policy stabilizes early. The implicit policy at iteration k is greedy w.r.t. V_k. It typically settles to pi^* long before V finishes converging numerically. Stopping VI when the greedy policy stops changing is a standard, often-large speedup.
  • Worked on A/B at gamma = 0.9. Starting V_0 = (0, 0) the sweeps give V_1 = (1, 2), V_2 = (1.9, 2.9), V_3 = (2.71, 3.71), V_4 = (3.439, 4.439), creeping toward V^* = (10, 11). The greedy policy is (stay, switch) = pi^ from iteration 1*.
  • VI as the GPI extreme. Three points on the same evaluate-then-improve spectrum: full PI (many evaluation sweeps per improvement), truncated PI (a few), and VI (exactly one). All converge to the same V^* and pi^*; the trade-off is cost-per-step against number-of-steps.
  • Pre-figures the rest of the track. Q-learning is VI on Q with the expectation replaced by a sample. DQN is the same update with the table replaced by a neural network. The recursion you iterate here is the one those methods estimate.

You now have the second (and simpler) of the two planning algorithms, and a habit you will keep using: iterate the Bellman update, and let the policy fall out as the greedy action. The early-stopping trick is the most practically useful piece, the greedy policy often settles in a handful of sweeps even when V’s last few digits keep crawling, so you do not need to wait for V to converge to act optimally. The bigger payoff is structural: the VI update is the shape of every later algorithm in the track, with the expectation either summed exactly (planning, Phase 2) or estimated from samples (model-free learning, Phase 3) and the value either tabular (here) or function-approximated (Phase 4). One recursion, three settings.