Skip to content

Cheatsheet: Policy iteration

Alternate (1) policy evaluation (solve V^pi via Bellman expectation) with (2) policy improvement (pi’ = greedy w.r.t. V^pi). In a finite MDP this provably converges to the optimal pi^*.

EVALUATION (Bellman expectation, solved or iterated):
V_{k+1}(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]
Converges to V^pi at rate gamma^k.
IMPROVEMENT (greedy w.r.t. V^pi):
pi'(s) = argmax_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V^pi(s') ]
= argmax_a Q^pi(s, a)
1. Initialize pi_0 (any policy; e.g. always action 1).
2. EVALUATE: compute V^(pi_k).
3. IMPROVE: pi_(k+1) = greedy(V^(pi_k)).
4. If pi_(k+1) == pi_k, STOP -- pi_k is the optimal pi^*.
Otherwise k <- k+1, go to 2.
- Policy-improvement theorem: pi' greedy w.r.t. V^pi => V^(pi')(s) >= V^pi(s) for all s
(strict in at least one state unless already optimal).
- Finitely many deterministic policies (|A|^|S|) + monotone improvement = termination.
- Stable policy is greedy w.r.t. its own value = Bellman optimality = optimal.
H/S MDP, gamma = 0.9, single-state-self-loop dynamics under "stay":
pi_0 = (A: stay, B: stay)
Eval: V(A) = 1 + 0.9 V(A) -> V(A) = 10; V(B) = -1 + 0.9 V(B) -> V(B) = -10
Improve at B: Q(B, stay) = -10, Q(B, switch) = 2 + 0.9*10 = 11 -> FLIP to switch
pi_1 = (A: stay, B: switch)
Eval: V(A) = 10; V(B) = 2 + 0.9*10 = 11
Improve: nothing changes (pi_2 = pi_1) -> STABLE.
pi^* = (A: stay, B: switch), V^*(A) = 10, V^*(B) = 11.
Almost every RL method = "estimate value -> improve policy greedily -> repeat", with
choices for HOW MUCH evaluation and WHERE the values come from.
Full PI : evaluate to convergence each round.
Truncated PI : a few sweeps per improvement.
Value iteration : one sweep per improvement (next lesson).
TD / Q-learning : interleave one sample at a time (Phase 3).
Actor-critic : explicit policy + value critic, interleaved.
  • Treating evaluation as one substitution (it is iterative in the cyclic case).
  • Assuming improvement makes things STRICTLY better in every state (only no worse, with at least one strict gain unless optimal).
  • Mixing the Bellman expectation (sum under pi) with the optimality equation (max over actions) in either step.
  • Inconsistent argmax tie-breaking (deterministic rule prevents chatter).
  • Believing tabular PI scales — big state spaces need function approximation (lesson 9).
  • Policy evaluation: computing V^pi for a fixed pi (Bellman expectation equation).
  • Policy improvement: pi’(s) = argmax_a Q^pi(s, a); greedy w.r.t. V^pi.
  • Policy-improvement theorem: greedy improvement gives V^(pi’)(s) >= V^pi(s) everywhere.
  • Generalized policy iteration (GPI): any method that alternates value estimation and greedy improvement.
  • Planning vs learning: planning knows P and R; learning has only samples (Phase 3).