References: Policy iteration

Source material

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 3:
  Planning by Dynamic Programming (policy iteration section)
  Author: David Silver
  Course page: https://davidstarsilver.wordpress.com/teaching/
  License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause is now consistent with Clawdemy's own CC BY-NC-SA 4.0 license; both forbid commercial use without permission. Commercial use is licensed separately at [/legal/licensing](/legal/licensing/).
All rights to the original materials remain with the author and UCL.

Source-scope note: this lesson mirrors the policy-iteration section of
Silver's Lecture 3 and restates it in Clawdemy's voice with an original
two-state two-action worked example that runs end-to-end (two iterations,
one policy flip, then stable). The explicit policy-improvement theorem
statement, the finitely-many-deterministic-policies termination argument,
the GPI lens, and the planning-vs-learning framing that explicitly defers
the sample-based version to Phase 3 are Clawdemy framing. Value iteration
(the other algorithm in Silver's Lecture 3) is held back to the next lesson.
Exact per-lecture URLs are verified at promotion.

Read this next

David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming by David Silver. The lecture this lesson mirrors, with both policy iteration and value iteration developed alongside a worked gridworld example. CC BY-NC 4.0, freely available. The next Track 17 lesson is on value iteration, which Silver presents in the same lecture.

Going deeper

A short, durable list. Both are free.

Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 4 (Dynamic Programming). The standard textbook’s treatment of policy iteration, value iteration, asynchronous DP, and the GPI lens, with the small gridworld example worked carefully.
David Silver, UCL RL course, Lecture 4: Model-Free Prediction (within the course above). The bridge into Phase 3: estimating V^pi from samples (Monte Carlo and TD), which is policy evaluation without a model. The evaluate-then-improve template from this lesson reappears in sample form.

Adjacent topics

Where this leads inside this track.

Value functions and the Bellman equations. The previous lesson. Policy iteration is the first algorithm that actually solves the Bellman expectation equation it wrote down.
Value iteration. The next lesson. The other major dynamic-programming algorithm: iterate the Bellman OPTIMALITY equation (max over actions) directly, instead of full policy evaluation + improvement.
Model-free learning (Phase 3). Lessons 6-8. The same evaluate-then-improve idea reappears, but the value estimation step is done from samples (Monte Carlo, TD) instead of using P and R.