References: Policy iteration
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lecture 3: Planning by Dynamic Programming (policy iteration section) Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the policy-iteration section ofSilver's Lecture 3 and restates it in Clawdemy's voice with an originaltwo-state two-action worked example that runs end-to-end (two iterations,one policy flip, then stable). The explicit policy-improvement theoremstatement, the finitely-many-deterministic-policies termination argument,the GPI lens, and the planning-vs-learning framing that explicitly defersthe sample-based version to Phase 3 are Clawdemy framing. Value iteration(the other algorithm in Silver's Lecture 3) is held back to the next lesson.Exact per-lecture URLs are verified at promotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming by David Silver. The lecture this lesson mirrors, with both policy iteration and value iteration developed alongside a worked gridworld example. CC BY-NC 4.0, freely available. The next Track 17 lesson is on value iteration, which Silver presents in the same lecture.
Going deeper
Section titled “Going deeper”A short, durable list. Both are free.
- Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 4 (Dynamic Programming). The standard textbook’s treatment of policy iteration, value iteration, asynchronous DP, and the GPI lens, with the small gridworld example worked carefully.
- David Silver, UCL RL course, Lecture 4: Model-Free Prediction (within the course above). The bridge into Phase 3: estimating V^pi from samples (Monte Carlo and TD), which is policy evaluation without a model. The evaluate-then-improve template from this lesson reappears in sample form.
Adjacent topics
Section titled “Adjacent topics”Where this leads inside this track.
- Value functions and the Bellman equations. The previous lesson. Policy iteration is the first algorithm that actually solves the Bellman expectation equation it wrote down.
- Value iteration. The next lesson. The other major dynamic-programming algorithm: iterate the Bellman OPTIMALITY equation (max over actions) directly, instead of full policy evaluation + improvement.
- Model-free learning (Phase 3). Lessons 6-8. The same evaluate-then-improve idea reappears, but the value estimation step is done from samples (Monte Carlo, TD) instead of using P and R.