Practice: Value functions and the Bellman equations

Two skills to build: writing the Bellman equation for a specific MDP (which forces you to track sums over actions and over next states) and using the recursion either forward (chain example) or as a fixed-point (cyclic case). Keep a scratchpad.

Self-check

Six short questions. Answer each in your head before opening the collapsible.

1. Define V^pi(s) and Q^pi(s, a), and write the relation between them.

Show answer

V^pi(s) is the expected discounted return from state s under policy pi: E_pi[G_t | s_t = s]. Q^pi(s, a) is the same but with the first action pinned to a, following pi thereafter: E_pi[G_t | s_t = s, a_t = a]. They are linked by V^pi(s) = sum_a pi(a | s) * Q^pi(s, a) — V is the policy-weighted average of Q over actions.

2. Why does control usually need Q, not just V?

Show answer

Because V tells you how good a state is but not which action to take from it without looking ahead through the model. Q has the action choice baked in: to pick an action, compare Q(s, a) across actions. In model-free settings (no P available), Q is what you can act on directly; V is not.

3. Derive the Bellman expectation equation for V in one line.

Show answer

Start with G_t = r_(t+1) + gamma * G_(t+1) (the return splits into one immediate reward plus the discounted rest). Take E_pi of both sides conditional on s_t = s. The left becomes V^pi(s); the right becomes the expected r_(t+1) plus gamma times the expected V^pi at the next state, averaged over actions under pi and next states under P:

V^pi(s) = sum_a pi(a | s) * [R(s, a) + gamma * sum_s’ P(s’ | s, a) * V^pi(s’)].

4. What is the difference between the Bellman EXPECTATION and OPTIMALITY equations?

Show answer

The expectation equation uses a sum over actions weighted by a given policy pi (averaging the actions pi might take). The optimality equation uses a max over actions (taking the best one). The expectation equation evaluates V (or Q) under a particular policy; the optimality equation defines V^* (or Q^*) — the best you could do over any policy.

5. Why is the optimal policy greedy with respect to Q^*?

Show answer

Because Q^*(s, a) is the best total reward you can achieve starting with action a in s. So at every state s, picking argmax_a Q^*(s, a) is by definition the best choice. Acting greedily with respect to Q^* requires no separate policy network; the policy falls out of the value, which is why value-based methods are powerful.

6. In a cyclic MDP, why can’t you compute V from the Bellman expectation equation by direct substitution?

Show answer

Because V appears on both sides of the equation (the next state may be the same as the current state, or reachable through a cycle), so direct substitution recurses forever. The Bellman equation in the cyclic case is a fixed-point equation; you need a linear-system solve or an iterative method (Phase 2’s policy and value iteration) to find V.

Try it yourself: compute the chain recursion

A four-state chain A -> B -> C -> D under a forced policy (one action each), deterministic transitions, gamma = 0.8. The value V(D) = 5 is given. Rewards R(C) = 4, R(B) = 1, R(A) = 2.

Compute V(C), V(B), V(A) using V(s) = R(s) + gamma * V(next state).

Show answer

V(C) = R(C) + gamma * V(D) = 4 + 0.8 * 5  = 4 + 4   = 8
V(B) = R(B) + gamma * V(C) = 1 + 0.8 * 8  = 1 + 6.4 = 7.4
V(A) = R(A) + gamma * V(B) = 2 + 0.8 * 7.4 = 2 + 5.92 = 7.92

The chain is the easy case: value at the end is known, and each upstream value computes from one applied recursion. No cycles, no fixed point, no iteration needed. The cyclic case (next exercise) is where iteration matters.

Try it yourself: write the Bellman equation + greedy from Q^*

Part A. A new 2-state MDP with one action “work” from each state:

States: {Calm, Busy}
From Calm,  action "work": -> Calm with prob 0.7, -> Busy with prob 0.3.  R = +2
From Busy,  action "work": -> Calm with prob 0.4, -> Busy with prob 0.6.  R = -3
gamma = 0.9

Write the Bellman expectation equation for V(Calm) and V(Busy) under this (single, forced) policy. You do not need to solve the system; just write it down.

Part B. Suppose for some state s you are told Q^(s, a1) = 5, Q^(s, a2) = 7, Q^*(s, a3) = 6. What is the optimal action at s?

Show answer

Part A.

V(Calm) = 2  + 0.9 * [ 0.7 * V(Calm) + 0.3 * V(Busy) ]
V(Busy) = -3 + 0.9 * [ 0.4 * V(Calm) + 0.6 * V(Busy) ]

Two equations in two unknowns: V appears on both sides (cyclic), so this is a fixed-point system, not a one-shot calculation. Phase 2 develops the iterative methods that converge to (V(Calm), V(Busy)) at scale.

Part B. Greedy with respect to Q^: pick the action with the largest Q^. The values are 5, 7, 6, so the optimum is a2 (which has Q^* = 7). That is the entire policy-extraction step in a value-based method: argmax over actions.

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.

Q. Define V^pi(s) and Q^pi(s, a).

V^pi(s) = expected discounted return from s under pi. Q^pi(s, a) = expected discounted return from s after first taking a (then following pi). Linked by V^pi(s) = sum_a pi(a|s) * Q^pi(s, a).

Q. Why does model-free control need Q rather than V?

V tells you how good a state is but not which action; to choose an action from V you would need to look ahead through P. Q has the action baked in (compare Q(s, a) across actions), so it lets you act without a model.

Q. Write the Bellman expectation equation for V^pi(s).

V^pi(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_s’ P(s’|s, a) * V^pi(s’) ]. A one-step recursion: value here = immediate reward + discounted expected value at the next state, averaged over actions under pi.

Q. How is the Bellman equation derived in one line?

From G_t = r_(t+1) + gamma * G_(t+1). Take E_pi conditional on s_t = s; the left becomes V^pi(s); the right becomes the expected r_(t+1) plus gamma * expected V^pi(s_(t+1)), averaged over actions and next states.

Q. What is the difference between the Bellman expectation and optimality equations?

Expectation: sum over actions weighted by a given policy pi. Optimality: max over actions. Expectation evaluates V under a specific policy; optimality defines V^* (the best over any policy).

Q. Write the Bellman optimality equation for Q^*(s, a).

Q^(s, a) = R(s, a) + gamma * sum_s’ P(s’|s, a) * max_a’ Q^(s’, a’). The reward plus the discounted optimal value at the next state, where ‘optimal at next state’ means picking the best next action.

Q. What is the optimal policy given Q^*?

Greedy: pi^(s) = argmax_a Q^(s, a). The action with the largest Q^* at each state. The policy falls out of the value; no separate policy is needed.

Q. Why is the cyclic Bellman equation a fixed-point equation, not a one-shot calculation?

Because V appears on both sides (the next state may be reachable back to the current one via a cycle), so direct substitution recurses forever. The solution is a fixed point you reach by iteration (Phase 2) or a linear-system solve.