Value functions and Bellman: brief

What you’ll learn

This is lesson 3 of Track 17 (Reinforcement Learning Foundations) and the close of Phase 1 (The RL setup). The previous lesson gave you the MDP, the formal object. This lesson adds the value functions V and Q (how good is a state? how good is an action from a state?) and the Bellman equations that they satisfy, a one-step recursion that is the single most important mathematical object in reinforcement learning. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson defines V and Q as expected returns, derives the Bellman expectation equation in one line from G_t = r_(t+1) + gamma * G_(t+1), shows the recursion at work on a four-state chain (V at the end is given, the rest fall out by one applied recursion each), explains why a cyclic MDP makes the equation a fixed-point system rather than a one-shot calculation, writes the Bellman optimality equation (max over actions, not sum under pi), and closes on the architectural payoff: the optimal policy is greedy with respect to Q^*, which is why so much of RL is, really, methods for estimating Q.

Where this fits

This is lesson 3 of 10 and the last lesson of Phase 1. The previous two lessons set up the RL problem (lesson 1, informal; lesson 2, formal); this lesson adds the value functions and the Bellman equations on top of the MDP. Phase 2 (lessons 4 and 5) iterates these equations directly when the model is known. Phase 3 (lessons 6-8) approximates them from samples when the model is not. The recursion in this lesson is what every later method is solving or estimating, in different forms.

Before you start

Prerequisites: the previous lesson (Markov Decision Processes) for the MDP tuple and the discounted return. Comfort with expectations and conditional expectations (Track 9’s random-variable material is sufficient), and reading sums over actions and over next states (sum_a, sum_s’). No calculus required for this lesson; matrix-style thinking helps but is not assumed.

About the math

This is the most notation-heavy lesson of Phase 1, but every formula is built up step by step from G_t = r_(t+1) + gamma * G_(t+1). The chain example uses small integer arithmetic; the cyclic 2x2 system is written down rather than solved (the algorithms that solve it are the next two lessons). The Bellman optimality equation differs from the expectation equation in exactly one place (max instead of sum), and the lesson highlights that single difference.

By the end, you’ll be able to

Define the state-value V(s) and the action-value Q(s, a) as expected returns under a policy
Write the Bellman expectation equation for V and Q and see it as a one-step recursion
Derive the recursion from G_t = r_(t+1) + gamma * G_(t+1)
Write the Bellman optimality equation and distinguish max-over-actions from the policy-weighted sum
State why an optimal policy is greedy with respect to Q^* and why that makes value-based methods possible

Time and difficulty

Read time: about 13 minutes
Practice time: about 16 minutes (a self-check, a chain-recursion computation, a write-the-Bellman-equation-and-find-greedy exercise, and flashcards)
Difficulty: standard (notation-heavy by Phase-1 standards, but every symbol is grounded in a small worked example; this is the last lesson before the planning algorithms of Phase 2)