Skip to content

References: Policy iteration

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 3:
Planning by Dynamic Programming (policy iteration section)
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the policy-iteration section of
Silver's Lecture 3 and restates it in Clawdemy's voice with an original
two-state two-action worked example that runs end-to-end (two iterations,
one policy flip, then stable). The explicit policy-improvement theorem
statement, the finitely-many-deterministic-policies termination argument,
the GPI lens, and the planning-vs-learning framing that explicitly defers
the sample-based version to Phase 3 are Clawdemy framing. Value iteration
(the other algorithm in Silver's Lecture 3) is held back to the next lesson.
Exact per-lecture URLs are verified at promotion.

A short, durable list. Both are free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 4 (Dynamic Programming). The standard textbook’s treatment of policy iteration, value iteration, asynchronous DP, and the GPI lens, with the small gridworld example worked carefully.
  • David Silver, UCL RL course, Lecture 4: Model-Free Prediction (within the course above). The bridge into Phase 3: estimating V^pi from samples (Monte Carlo and TD), which is policy evaluation without a model. The evaluate-then-improve template from this lesson reappears in sample form.

Where this leads inside this track.

  • Value functions and the Bellman equations. The previous lesson. Policy iteration is the first algorithm that actually solves the Bellman expectation equation it wrote down.
  • Value iteration. The next lesson. The other major dynamic-programming algorithm: iterate the Bellman OPTIMALITY equation (max over actions) directly, instead of full policy evaluation + improvement.
  • Model-free learning (Phase 3). Lessons 6-8. The same evaluate-then-improve idea reappears, but the value estimation step is done from samples (Monte Carlo, TD) instead of using P and R.