References: Value functions and the Bellman equations

Source material

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lectures 2 and 3:
  Markov Decision Processes (value functions, Bellman equations) and
  Planning by Dynamic Programming (the optimality equation)
  Author: David Silver
  Course page: https://davidstarsilver.wordpress.com/teaching/
  License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause is now consistent with Clawdemy's own CC BY-NC-SA 4.0 license; both forbid commercial use without permission. Commercial use is licensed separately at [/legal/licensing](/legal/licensing/).
All rights to the original materials remain with the author and UCL.

Source-scope note: this lesson combines the value-function content from the
end of Silver's Lecture 2 with the Bellman optimality equation that opens
Lecture 3, restated in Clawdemy's voice with original examples (the four-state
chain and the explicit cyclic-vs-acyclic distinction). The one-line derivation
of the Bellman equation from G_t = r_(t+1) + gamma * G_(t+1), the "V = sum of
pi-weighted Q" relation, the "greedy w.r.t. Q^* is optimal" architectural
point, and the explicit Phase-2-vs-Phase-3 framing are Clawdemy framing.
Lecture 3's algorithms (policy iteration, value iteration) are deliberately
held back to the next two lessons rather than crowded into this one. Exact
per-lecture URLs are verified at promotion.

Read this next

David Silver, UCL RL course, Lectures 2 and 3 by David Silver. The lectures this lesson draws from, with the value functions and Bellman equations developed alongside the wider MDP treatment and a smooth transition into the planning algorithms of the next two Track 17 lessons. CC BY-NC 4.0, freely available.

Going deeper

A short, durable list. Both are free.

Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 3 (Finite MDPs) and Chapter 4 (Dynamic Programming). The standard textbook’s treatment of the same material, with worked gridworld examples and careful coverage of the Bellman operators as contraction mappings.
David Silver, UCL RL course, Lecture 3: Planning by Dynamic Programming (within the course above). Goes from the Bellman equations introduced here to policy iteration and value iteration — exactly the Phase 2 material of Track 17 (lessons 4 and 5).

Adjacent topics

Where this leads inside this track.

Markov Decision Processes. The previous lesson. The MDP gave you the formal object; this lesson adds the value functions and the Bellman equations on top of it.
Policy iteration. The next lesson and the start of Phase 2. It iterates the Bellman expectation equation (policy evaluation) with greedy improvement to converge to the optimal policy.
Value iteration. Lesson 5. It iterates the Bellman optimality equation directly as a fixed-point update to converge to V^, then reads off the policy as argmax_a Q^.