Policy iteration: brief

What you’ll learn

This is lesson 4 of Track 17 (Reinforcement Learning Foundations) and the opener of Phase 2 (Planning with a known model). The previous lesson wrote the Bellman equations but left the cyclic case as a fixed-point system. Policy iteration is the first algorithm that actually solves that system, and it does it by alternating two steps that both follow directly from the previous lesson: evaluate the current policy (solve its Bellman expectation equation for V^pi), then improve the policy by acting greedily with respect to V^pi. The loop provably terminates at the optimal policy in any finite MDP. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson lays out both phases precisely, gives the algorithm as a four-step loop, states the policy-improvement theorem and why it makes the algorithm converge, runs the procedure end-to-end on a two-state, two-action MDP through two iterations (one round flips a single action; the second round confirms the policy is stable), and introduces the generalized policy iteration lens, the family resemblance that ties value iteration, TD methods, Q-learning, and actor-critic together as variations on “evaluate-then-improve.”

Where this fits

This is lesson 4 of 10 and the first lesson of Phase 2. It uses everything from Phase 1: the MDP (lesson 2), the value functions and Bellman equations (lesson 3), and the optimal-policy-is-greedy-w.r.t.-Q^* architectural payoff from there. The next lesson, Value iteration, is the other major planning algorithm and a tighter way to iterate the Bellman optimality equation directly. Phase 3 will re-do the evaluate-then-improve loop from samples when P and R are unknown, which is what makes RL work in the real world.

Before you start

Prerequisites: the previous lesson (Value functions and the Bellman equations) for both the Bellman expectation and optimality equations. Comfort with sums over actions and over next states, and with reading argmax over actions. Small linear algebra (solving a 2x2 system) shows up but is not assumed; we use the simplest cases throughout.

About the math

The arithmetic is hand-sized: at each round you solve one or two small linear equations for V (often by recognizing a self-loop), compute Q for each action by one substitution, and pick an argmax. Every step in the two-state worked example is shown explicitly; the practice runs the same procedure on a fresh MDP. No new formulas beyond the Bellman expectation update and the greedy improvement step.

By the end, you’ll be able to

Explain the two phases of policy iteration (policy evaluation and policy improvement)
Compute V^pi for a small MDP by solving the Bellman expectation equation
Apply greedy policy improvement using the action-value Q^pi
Run policy iteration through several iterations on a small MDP and recognize the termination condition
State the policy-improvement theorem and why policy iteration converges in a finite MDP

Time and difficulty

Read time: about 13 minutes
Practice time: about 16 minutes (a self-check, a run-policy-iteration-end-to-end exercise on a fresh 2-state-2-action MDP, a what-the-improvement-theorem-promises true/false drill, and flashcards)
Difficulty: standard (concrete arithmetic, every step shown; this is the first complete RL algorithm in the track)