Skip to content

References: Value functions and the Bellman equations

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lectures 2 and 3:
Markov Decision Processes (value functions, Bellman equations) and
Planning by Dynamic Programming (the optimality equation)
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson combines the value-function content from the
end of Silver's Lecture 2 with the Bellman optimality equation that opens
Lecture 3, restated in Clawdemy's voice with original examples (the four-state
chain and the explicit cyclic-vs-acyclic distinction). The one-line derivation
of the Bellman equation from G_t = r_(t+1) + gamma * G_(t+1), the "V = sum of
pi-weighted Q" relation, the "greedy w.r.t. Q^* is optimal" architectural
point, and the explicit Phase-2-vs-Phase-3 framing are Clawdemy framing.
Lecture 3's algorithms (policy iteration, value iteration) are deliberately
held back to the next two lessons rather than crowded into this one. Exact
per-lecture URLs are verified at promotion.
  • David Silver, UCL RL course, Lectures 2 and 3 by David Silver. The lectures this lesson draws from, with the value functions and Bellman equations developed alongside the wider MDP treatment and a smooth transition into the planning algorithms of the next two Track 17 lessons. CC BY-NC 4.0, freely available.

A short, durable list. Both are free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 3 (Finite MDPs) and Chapter 4 (Dynamic Programming). The standard textbook’s treatment of the same material, with worked gridworld examples and careful coverage of the Bellman operators as contraction mappings.
  • David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming (within the course above). Goes from the Bellman equations introduced here to policy iteration and value iteration — exactly the Phase 2 material of Track 17 (lessons 4 and 5).

Where this leads inside this track.

  • Markov Decision Processes. The previous lesson. The MDP gave you the formal object; this lesson adds the value functions and the Bellman equations on top of it.
  • Policy iteration. The next lesson and the start of Phase 2. It iterates the Bellman expectation equation (policy evaluation) with greedy improvement to converge to the optimal policy.
  • Value iteration. Lesson 5. It iterates the Bellman optimality equation directly as a fixed-point update to converge to V^, then reads off the policy as argmax_a Q^.