Cheatsheet: Policy iteration
The one idea
Section titled “The one idea”Alternate (1) policy evaluation (solve V^pi via Bellman expectation) with (2) policy improvement (pi’ = greedy w.r.t. V^pi). In a finite MDP this provably converges to the optimal pi^*.
The two phases
Section titled “The two phases”EVALUATION (Bellman expectation, solved or iterated): V_{k+1}(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ] Converges to V^pi at rate gamma^k.
IMPROVEMENT (greedy w.r.t. V^pi): pi'(s) = argmax_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V^pi(s') ] = argmax_a Q^pi(s, a)The algorithm
Section titled “The algorithm”1. Initialize pi_0 (any policy; e.g. always action 1).2. EVALUATE: compute V^(pi_k).3. IMPROVE: pi_(k+1) = greedy(V^(pi_k)).4. If pi_(k+1) == pi_k, STOP -- pi_k is the optimal pi^*. Otherwise k <- k+1, go to 2.Why it converges
Section titled “Why it converges”- Policy-improvement theorem: pi' greedy w.r.t. V^pi => V^(pi')(s) >= V^pi(s) for all s (strict in at least one state unless already optimal).- Finitely many deterministic policies (|A|^|S|) + monotone improvement = termination.- Stable policy is greedy w.r.t. its own value = Bellman optimality = optimal.Worked example (lesson)
Section titled “Worked example (lesson)”H/S MDP, gamma = 0.9, single-state-self-loop dynamics under "stay": pi_0 = (A: stay, B: stay) Eval: V(A) = 1 + 0.9 V(A) -> V(A) = 10; V(B) = -1 + 0.9 V(B) -> V(B) = -10 Improve at B: Q(B, stay) = -10, Q(B, switch) = 2 + 0.9*10 = 11 -> FLIP to switch pi_1 = (A: stay, B: switch) Eval: V(A) = 10; V(B) = 2 + 0.9*10 = 11 Improve: nothing changes (pi_2 = pi_1) -> STABLE.pi^* = (A: stay, B: switch), V^*(A) = 10, V^*(B) = 11.Generalized policy iteration (GPI) lens
Section titled “Generalized policy iteration (GPI) lens”Almost every RL method = "estimate value -> improve policy greedily -> repeat", withchoices for HOW MUCH evaluation and WHERE the values come from.
Full PI : evaluate to convergence each round. Truncated PI : a few sweeps per improvement. Value iteration : one sweep per improvement (next lesson). TD / Q-learning : interleave one sample at a time (Phase 3). Actor-critic : explicit policy + value critic, interleaved.Pitfalls to dodge
Section titled “Pitfalls to dodge”- Treating evaluation as one substitution (it is iterative in the cyclic case).
- Assuming improvement makes things STRICTLY better in every state (only no worse, with at least one strict gain unless optimal).
- Mixing the Bellman expectation (sum under pi) with the optimality equation (max over actions) in either step.
- Inconsistent argmax tie-breaking (deterministic rule prevents chatter).
- Believing tabular PI scales — big state spaces need function approximation (lesson 9).
Words to use precisely
Section titled “Words to use precisely”- Policy evaluation: computing V^pi for a fixed pi (Bellman expectation equation).
- Policy improvement: pi’(s) = argmax_a Q^pi(s, a); greedy w.r.t. V^pi.
- Policy-improvement theorem: greedy improvement gives V^(pi’)(s) >= V^pi(s) everywhere.
- Generalized policy iteration (GPI): any method that alternates value estimation and greedy improvement.
- Planning vs learning: planning knows P and R; learning has only samples (Phase 3).