Summary: Policy iteration

Policy iteration alternates two simple steps to find the optimal policy: evaluate the current policy (solve its Bellman expectation equation), then improve it greedily. In any finite MDP this loop terminates at pi^*. Beyond the algorithm itself, the evaluate-then-improve pattern is the template almost every later RL method imitates. This summary is the scan-in-five-minutes version of the full lesson.

Core ideas

Two phases. Evaluation: for the current pi, compute V^pi by solving (or iterating) the Bellman expectation equation. Improvement: set pi’(s) = argmax_a [R(s, a) + gamma * sum_(s’) P(s’|s, a) V^pi(s’)]. Repeat.
Termination. Stop when pi’_k = pi_k. That means the policy is greedy with respect to its own value, which exactly satisfies the Bellman optimality equation, so the policy is optimal.
The improvement theorem. A greedy improvement gives a policy no worse than the previous in any state, and strictly better in at least one state unless already optimal. Monotone improvement on a finite set of deterministic policies guarantees termination at pi^*.
Worked: two-state, two-action MDP, gamma = 0.9. Starting from pi_0 = “always stay” gives V(A) = 10, V(B) = -10. Improvement flips B’s action to “switch” (Q jumps from -10 to +11), giving pi_1 = (A: stay, B: switch). Re-evaluating gives V(A) = 10, V(B) = 11; improvement does not change pi, so pi_1 is optimal. Two rounds: one flip, one confirm.
Generalized policy iteration (GPI). Almost every RL method is some flavor of “estimate value, improve policy greedily, repeat.” Value iteration (next lesson) is GPI with only one Bellman sweep per improvement; TD methods (Phase 3) interleave the two steps sample-by-sample; actor-critic learns an explicit policy guided by a critic’s value. All are GPI.
Planning, not learning. Policy iteration here uses P and R directly. Phase 3 re-does the same evaluate-then-improve loop from samples when P and R are unknown, which is what makes RL work in real systems.

What changes for you

You have the first complete RL algorithm, with a convergence guarantee in finite MDPs and a clean pedagogical anchor: improve, evaluate, repeat until nothing changes. The mental model is the more valuable takeaway than the procedure. When you read about almost any later RL method, you can ask: where is the value-estimation step? where is the policy-improvement step? how are they being interleaved? Value iteration (next lesson) and the TD methods of Phase 3 will both register as different ways to instantiate the same evaluate-then-improve loop, with different choices for how thoroughly to evaluate and how the values are computed when the model is unknown. The architectural payoff from the previous lesson (greedy w.r.t. Q^* is optimal) is what makes the improvement step work; this lesson is where that payoff turns into a method.