Skip to content

Summary: Value functions and the Bellman equations

Value functions say how good things are; the Bellman equations say that value is recursive: where you are equals one step of reward plus the discounted value of where you land. Every later method in the track either solves these equations directly (Phase 2 planning) or approximates them from samples (Phase 3 learning). This summary is the scan-in-five-minutes version of the full lesson.

  • Two value functions. V^pi(s) is the expected discounted return from state s under policy pi: E_pi[G_t | s_t = s]. Q^pi(s, a) is the same with the first action pinned. They are linked by V^pi(s) = sum_a pi(a | s) * Q^pi(s, a).
  • Why both. V tells you how good a state is; Q tells you which action is best (compare Q-values). Without a model you cannot act on V (no lookahead); Q has the action choice baked in, which is why model-free methods learn Q.
  • The Bellman expectation equation. Derived in one line from G_t = r_(t+1) + gamma * G_(t+1) by taking expectations. The result for V: V^pi(s) = sum_a pi(a | s) * [R(s, a) + gamma * sum_s’ P(s’ | s, a) * V^pi(s’)]. A one-step recursion: value here equals immediate reward plus discounted expected value at the next state, averaged appropriately.
  • The recursion in action. On a 3-state chain with V at the end known (V(D) = 5, gamma = 0.8, rewards along the way), one applied recursion at a time gives V(C) = 8, V(B) = 7.4, V(A) = 7.92. With cycles, V appears on both sides; the equation becomes a fixed-point that needs iteration (Phase 2).
  • The Bellman optimality equation. Replace the policy-weighted sum with a max over actions: V^(s) = max_a [R(s, a) + gamma * sum_s’ P(s’ | s, a) * V^(s’)] (and similarly for Q^*). It defines the best you can do over any policy.
  • Greedy is optimal. Once you have Q^, the optimal policy is just **pi^(s) = argmax_a Q^*(s, a)**. The policy falls out of the value; you do not need a separate policy network. This is why value-based methods are powerful.

You have the one mathematical object the rest of the track is about. Phase 2 (lessons 4-5) will iterate these equations directly when the model is known; Phase 3 (lessons 6-8) will estimate the same Bellman target from samples when it is not. Function approximation (lesson 9) replaces the table with a neural network minimizing the Bellman residual; policy gradient (lesson 10) is a related-but-different recursion. Whenever you read “the loss is a Bellman residual” or “we used a TD(0) target,” it is this lesson’s recursion at work. The most actionable takeaway is the architectural one: solve for Q^*, and the policy is argmax — which is why so much of RL is, really, methods for estimating Q.