References: Value functions and the Bellman equations
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lectures 2 and 3: Markov Decision Processes (value functions, Bellman equations) and Planning by Dynamic Programming (the optimality equation) Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson combines the value-function content from theend of Silver's Lecture 2 with the Bellman optimality equation that opensLecture 3, restated in Clawdemy's voice with original examples (the four-statechain and the explicit cyclic-vs-acyclic distinction). The one-line derivationof the Bellman equation from G_t = r_(t+1) + gamma * G_(t+1), the "V = sum ofpi-weighted Q" relation, the "greedy w.r.t. Q^* is optimal" architecturalpoint, and the explicit Phase-2-vs-Phase-3 framing are Clawdemy framing.Lecture 3's algorithms (policy iteration, value iteration) are deliberatelyheld back to the next two lessons rather than crowded into this one. Exactper-lecture URLs are verified at promotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lectures 2 and 3 by David Silver. The lectures this lesson draws from, with the value functions and Bellman equations developed alongside the wider MDP treatment and a smooth transition into the planning algorithms of the next two Track 17 lessons. CC BY-NC 4.0, freely available.
Going deeper
Section titled “Going deeper”A short, durable list. Both are free.
- Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 3 (Finite MDPs) and Chapter 4 (Dynamic Programming). The standard textbook’s treatment of the same material, with worked gridworld examples and careful coverage of the Bellman operators as contraction mappings.
- David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming (within the course above). Goes from the Bellman equations introduced here to policy iteration and value iteration — exactly the Phase 2 material of Track 17 (lessons 4 and 5).
Adjacent topics
Section titled “Adjacent topics”Where this leads inside this track.
- Markov Decision Processes. The previous lesson. The MDP gave you the formal object; this lesson adds the value functions and the Bellman equations on top of it.
- Policy iteration. The next lesson and the start of Phase 2. It iterates the Bellman expectation equation (policy evaluation) with greedy improvement to converge to the optimal policy.
- Value iteration. Lesson 5. It iterates the Bellman optimality equation directly as a fixed-point update to converge to V^, then reads off the policy as argmax_a Q^.