Skip to content

Cheatsheet: Value iteration

Value iteration applies the Bellman optimality equation as a direct update. The Bellman optimality operator is a contraction with rate gamma, so V converges geometrically; the greedy policy stabilizes long before V does.

V_(k+1)(s) = max_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]

Sweep all states; repeat until V (or the greedy policy) stops moving.

The Bellman OPTIMALITY operator T^* is a contraction in sup-norm with constant gamma:
|| T^* U - T^* V ||_inf <= gamma * || U - V ||_inf
=> V_k -> V^* geometrically at RATE gamma, from any starting V_0.
Error after k iterations <= constant * gamma^k.
Evaluation: V_(k+1)(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]
-> computes V^pi for a given policy pi (Bellman EXPECTATION).
VI: V_(k+1)(s) = max_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]
-> computes V^* directly (Bellman OPTIMALITY).

Worked example (A/B, gamma = 0.9, V_0 = (0, 0))

Section titled “Worked example (A/B, gamma = 0.9, V_0 = (0, 0))”
V_1 = (1, 2) Greedy: (stay, switch) = pi^* (already!)
V_2 = (1.9, 2.9) Greedy: (stay, switch) (unchanged)
V_3 = (2.71, 3.71)
V_4 = (3.439, 4.439) creeping toward V^* = (10, 11)

Greedy policy stabilizes at iteration 1; V is still far from V^* after four sweeps.

Full PI : many evaluation sweeps per improvement.
Truncated PI : a few sweeps per improvement.
Value iteration : EXACTLY ONE Bellman sweep per "improvement" (improvement baked into max).
All converge to the same V^* / pi^*; trade cost-per-step vs steps-to-converge.
Q-learning : VI update on Q with the expectation replaced by a SAMPLE
(from (s, a, r, s'): nudge Q(s, a) toward r + gamma * max_a' Q(s', a')).
DQN : same update with the table replaced by a neural-network APPROXIMATOR.
Practical: stop when the GREEDY POLICY stops changing (often early).
Theoretical: stop when || V_(k+1) - V_k ||_inf < epsilon.
  • Confusing VI’s max-over-actions with policy evaluation’s sum-under-pi.
  • Waiting for V to fully converge instead of acting on the stabilized greedy policy.
  • Forgetting bounded rewards (the contraction needs them).
  • Treating VI as automatically fast (geometric rate gamma means slow when gamma is near 1).
  • Maintaining the policy explicitly during VI (no need; it is implicit).
  • Bellman optimality operator T^*: maps V to max_a [ R(s, a) + gamma * sum_(s’) P(s’|s, a) V(s’) ].
  • Contraction: a map that shrinks distances by a fixed factor (here gamma in sup-norm).
  • Fixed point: V^; the unique V with T^ V = V.
  • Generalized policy iteration (GPI): VI / truncated PI / full PI are points on this spectrum.
  • Early stopping by policy stabilization: practical VI stop rule.