Cheatsheet: Value functions and the Bellman equations
The one idea
Section titled “The one idea”Value functions say how good things are; the Bellman equations say value is recursive. Optimal policy = argmax over actions of Q^*.
V and Q
Section titled “V and Q”V^pi(s) = E_pi [ G_t | s_t = s ] "how good is this state?"Q^pi(s, a) = E_pi [ G_t | s_t = s, a_t = a ] "how good is this (s, a)?"V^pi(s) = sum over a of pi(a | s) * Q^pi(s, a) (V = policy-weighted average of Q)
Need Q (not V) for model-free control: V tells you which states are good butnot which action; Q has the action baked in.Bellman EXPECTATION (under a given policy pi)
Section titled “Bellman EXPECTATION (under a given policy pi)”V^pi(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_s' P(s'|s, a) * V^pi(s') ]Q^pi(s, a) = R(s, a) + gamma * sum_s' P(s'|s, a) * sum_a' pi(a'|s') * Q^pi(s', a')
Derivation (one line): G_t = r_(t+1) + gamma * G_(t+1); take E_pi of both sides.Bellman OPTIMALITY (best over any policy)
Section titled “Bellman OPTIMALITY (best over any policy)”V^*(s) = max_a [ R(s, a) + gamma * sum_s' P(s'|s, a) * V^*(s') ]Q^*(s, a) = R(s, a) + gamma * sum_s' P(s'|s, a) * max_a' Q^*(s', a')
Expectation -> sum over actions weighted by pi.Optimality -> MAX over actions.The recursion at work
Section titled “The recursion at work”Chain (deterministic, no cycles), V(D) = 5 given, gamma = 0.8: V(C) = R(C) + 0.8 * V(D) = 4 + 0.8*5 = 8 V(B) = R(B) + 0.8 * V(C) = 1 + 0.8*8 = 7.4 V(A) = R(A) + 0.8 * V(B) = 2 + 0.8*7.4 = 7.92
Cyclic (H/S MDP, single action, gamma = 0.9): V(H) = 1 + 0.9 * [ 0.8 V(H) + 0.2 V(S) ] V(S) = -1 + 0.9 * [ 0.5 V(H) + 0.5 V(S) ] V appears on both sides => fixed-point system; iterate (Phase 2) or linear-solve.Greedy from Q^*
Section titled “Greedy from Q^*”pi^*(s) = argmax over a of Q^*(s, a)=> once Q^* is in hand, the optimal policy is trivial. (Why value-based methods work.)Pitfalls to dodge
Section titled “Pitfalls to dodge”- Confusing expectation (sum under pi) with optimality (max over actions).
- Acting on V without a model (need Q for action choice without lookahead).
- Treating the cyclic Bellman equation as a one-shot substitution (it’s a fixed point).
- Off-by-one on reward subscripts: a_t in s_t produces r_(t+1), s_(t+1).
- Believing greedy w.r.t. ESTIMATED Q during learning is safe (need exploration; lesson 8).
Words to use precisely
Section titled “Words to use precisely”- V(s) / Q(s, a): state-value / action-value — expected returns.
- Bellman expectation equation: V or Q under a specific policy, as a recursion.
- Bellman optimality equation: V^* or Q^* as a recursion using max over actions.
- Fixed-point equation: V on both sides; solved by iteration or linear algebra.
- Greedy policy: pi(s) = argmax_a Q(s, a); optimal when Q = Q^*.