Value iteration: cheatsheet

The one idea

Value iteration applies the Bellman optimality equation as a direct update. The Bellman optimality operator is a contraction with rate gamma, so V converges geometrically; the greedy policy stabilizes long before V does.

The update

V_(k+1)(s) = max_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]

Sweep all states; repeat until V (or the greedy policy) stops moving.

Why it converges

The Bellman OPTIMALITY operator T^* is a contraction in sup-norm with constant gamma:
  || T^* U - T^* V ||_inf  <=  gamma * || U - V ||_inf

=> V_k -> V^* geometrically at RATE gamma, from any starting V_0.
   Error after k iterations <= constant * gamma^k.

VI update vs policy-evaluation update

Evaluation: V_(k+1)(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]
  -> computes V^pi for a given policy pi (Bellman EXPECTATION).

VI:         V_(k+1)(s) = max_a            [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]
  -> computes V^* directly (Bellman OPTIMALITY).

Worked example (A/B, gamma = 0.9, V_0 = (0, 0))

V_1 = (1,    2)         Greedy: (stay, switch) = pi^*  (already!)
V_2 = (1.9,  2.9)       Greedy: (stay, switch)         (unchanged)
V_3 = (2.71, 3.71)
V_4 = (3.439, 4.439)    creeping toward V^* = (10, 11)

Greedy policy stabilizes at iteration 1; V is still far from V^* after four sweeps.

VI on the GPI spectrum

Full PI         : many evaluation sweeps per improvement.
Truncated PI    : a few sweeps per improvement.
Value iteration : EXACTLY ONE Bellman sweep per "improvement" (improvement baked into max).
All converge to the same V^* / pi^*; trade cost-per-step vs steps-to-converge.

Pre-figures Phase 3 + 4

Q-learning : VI update on Q with the expectation replaced by a SAMPLE
             (from (s, a, r, s'): nudge Q(s, a) toward r + gamma * max_a' Q(s', a')).
DQN        : same update with the table replaced by a neural-network APPROXIMATOR.

Stopping rule

Practical:  stop when the GREEDY POLICY stops changing (often early).
Theoretical: stop when || V_(k+1) - V_k ||_inf < epsilon.

Pitfalls to dodge

Confusing VI’s max-over-actions with policy evaluation’s sum-under-pi.
Waiting for V to fully converge instead of acting on the stabilized greedy policy.
Forgetting bounded rewards (the contraction needs them).
Treating VI as automatically fast (geometric rate gamma means slow when gamma is near 1).
Maintaining the policy explicitly during VI (no need; it is implicit).

Words to use precisely

Bellman optimality operator T^*: maps V to max_a [ R(s, a) + gamma * sum_(s’) P(s’|s, a) V(s’) ].
Contraction: a map that shrinks distances by a fixed factor (here gamma in sup-norm).
Fixed point: V^; the unique V with T^ V = V.
Generalized policy iteration (GPI): VI / truncated PI / full PI are points on this spectrum.
Early stopping by policy stabilization: practical VI stop rule.