Policy iteration: cheatsheet

The one idea

Alternate (1) policy evaluation (solve V^pi via Bellman expectation) with (2) policy improvement (pi’ = greedy w.r.t. V^pi). In a finite MDP this provably converges to the optimal pi^*.

The two phases

EVALUATION (Bellman expectation, solved or iterated):
  V_{k+1}(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_(s') P(s'|s, a) V_k(s') ]
  Converges to V^pi at rate gamma^k.

IMPROVEMENT (greedy w.r.t. V^pi):
  pi'(s) = argmax_a [ R(s, a) + gamma * sum_(s') P(s'|s, a) V^pi(s') ]
         = argmax_a Q^pi(s, a)

The algorithm

1. Initialize pi_0 (any policy; e.g. always action 1).
2. EVALUATE: compute V^(pi_k).
3. IMPROVE:  pi_(k+1) = greedy(V^(pi_k)).
4. If pi_(k+1) == pi_k, STOP -- pi_k is the optimal pi^*.
   Otherwise k <- k+1, go to 2.

Why it converges

- Policy-improvement theorem: pi' greedy w.r.t. V^pi  =>  V^(pi')(s) >= V^pi(s) for all s
  (strict in at least one state unless already optimal).
- Finitely many deterministic policies (|A|^|S|) + monotone improvement = termination.
- Stable policy is greedy w.r.t. its own value = Bellman optimality = optimal.

Worked example (lesson)

H/S MDP, gamma = 0.9, single-state-self-loop dynamics under "stay":
  pi_0 = (A: stay, B: stay)
    Eval: V(A) = 1 + 0.9 V(A) -> V(A) = 10;   V(B) = -1 + 0.9 V(B) -> V(B) = -10
    Improve at B: Q(B, stay) = -10, Q(B, switch) = 2 + 0.9*10 = 11 -> FLIP to switch
  pi_1 = (A: stay, B: switch)
    Eval: V(A) = 10; V(B) = 2 + 0.9*10 = 11
    Improve: nothing changes (pi_2 = pi_1) -> STABLE.
pi^* = (A: stay, B: switch),  V^*(A) = 10,  V^*(B) = 11.

Generalized policy iteration (GPI) lens

Almost every RL method = "estimate value -> improve policy greedily -> repeat", with
choices for HOW MUCH evaluation and WHERE the values come from.

  Full PI         : evaluate to convergence each round.
  Truncated PI    : a few sweeps per improvement.
  Value iteration : one sweep per improvement (next lesson).
  TD / Q-learning : interleave one sample at a time (Phase 3).
  Actor-critic    : explicit policy + value critic, interleaved.

Pitfalls to dodge

Treating evaluation as one substitution (it is iterative in the cyclic case).
Assuming improvement makes things STRICTLY better in every state (only no worse, with at least one strict gain unless optimal).
Mixing the Bellman expectation (sum under pi) with the optimality equation (max over actions) in either step.
Inconsistent argmax tie-breaking (deterministic rule prevents chatter).
Believing tabular PI scales — big state spaces need function approximation (lesson 9).

Words to use precisely

Policy evaluation: computing V^pi for a fixed pi (Bellman expectation equation).
Policy improvement: pi’(s) = argmax_a Q^pi(s, a); greedy w.r.t. V^pi.
Policy-improvement theorem: greedy improvement gives V^(pi’)(s) >= V^pi(s) everywhere.
Generalized policy iteration (GPI): any method that alternates value estimation and greedy improvement.
Planning vs learning: planning knows P and R; learning has only samples (Phase 3).