Cheatsheet: Value iteration
The one idea
Section titled “The one idea”Value iteration applies the Bellman optimality equation as a direct update. The Bellman optimality operator is a contraction with rate gamma, so V converges geometrically; the greedy policy stabilizes long before V does.
The update
Section titled “The update”V_(k+1)(s) = max_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]Sweep all states; repeat until V (or the greedy policy) stops moving.
Why it converges
Section titled “Why it converges”The Bellman OPTIMALITY operator T^* is a contraction in sup-norm with constant gamma: || T^* U - T^* V ||_inf <= gamma * || U - V ||_inf
=> V_k -> V^* geometrically at RATE gamma, from any starting V_0. Error after k iterations <= constant * gamma^k.VI update vs policy-evaluation update
Section titled “VI update vs policy-evaluation update”Evaluation: V_(k+1)(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ] -> computes V^pi for a given policy pi (Bellman EXPECTATION).
VI: V_(k+1)(s) = max_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ] -> computes V^* directly (Bellman OPTIMALITY).Worked example (A/B, gamma = 0.9, V_0 = (0, 0))
Section titled “Worked example (A/B, gamma = 0.9, V_0 = (0, 0))”V_1 = (1, 2) Greedy: (stay, switch) = pi^* (already!)V_2 = (1.9, 2.9) Greedy: (stay, switch) (unchanged)V_3 = (2.71, 3.71)V_4 = (3.439, 4.439) creeping toward V^* = (10, 11)Greedy policy stabilizes at iteration 1; V is still far from V^* after four sweeps.
VI on the GPI spectrum
Section titled “VI on the GPI spectrum”Full PI : many evaluation sweeps per improvement.Truncated PI : a few sweeps per improvement.Value iteration : EXACTLY ONE Bellman sweep per "improvement" (improvement baked into max).All converge to the same V^* / pi^*; trade cost-per-step vs steps-to-converge.Pre-figures Phase 3 + 4
Section titled “Pre-figures Phase 3 + 4”Q-learning : VI update on Q with the expectation replaced by a SAMPLE (from (s, a, r, s'): nudge Q(s, a) toward r + gamma * max_a' Q(s', a')).DQN : same update with the table replaced by a neural-network APPROXIMATOR.Stopping rule
Section titled “Stopping rule”Practical: stop when the GREEDY POLICY stops changing (often early).Theoretical: stop when || V_(k+1) - V_k ||_inf < epsilon.Pitfalls to dodge
Section titled “Pitfalls to dodge”- Confusing VI’s max-over-actions with policy evaluation’s sum-under-pi.
- Waiting for V to fully converge instead of acting on the stabilized greedy policy.
- Forgetting bounded rewards (the contraction needs them).
- Treating VI as automatically fast (geometric rate gamma means slow when gamma is near 1).
- Maintaining the policy explicitly during VI (no need; it is implicit).
Words to use precisely
Section titled “Words to use precisely”- Bellman optimality operator T^*: maps V to max_a [ R(s, a) + gamma * sum_(s’) P(s’|s, a) V(s’) ].
- Contraction: a map that shrinks distances by a fixed factor (here gamma in sup-norm).
- Fixed point: V^; the unique V with T^ V = V.
- Generalized policy iteration (GPI): VI / truncated PI / full PI are points on this spectrum.
- Early stopping by policy stabilization: practical VI stop rule.