Skip to content

Cheatsheet: Value functions and the Bellman equations

Value functions say how good things are; the Bellman equations say value is recursive. Optimal policy = argmax over actions of Q^*.

V^pi(s) = E_pi [ G_t | s_t = s ] "how good is this state?"
Q^pi(s, a) = E_pi [ G_t | s_t = s, a_t = a ] "how good is this (s, a)?"
V^pi(s) = sum over a of pi(a | s) * Q^pi(s, a) (V = policy-weighted average of Q)
Need Q (not V) for model-free control: V tells you which states are good but
not which action; Q has the action baked in.

Bellman EXPECTATION (under a given policy pi)

Section titled “Bellman EXPECTATION (under a given policy pi)”
V^pi(s) = sum_a pi(a|s) * [ R(s, a) + gamma * sum_s' P(s'|s, a) * V^pi(s') ]
Q^pi(s, a) = R(s, a) + gamma * sum_s' P(s'|s, a) * sum_a' pi(a'|s') * Q^pi(s', a')
Derivation (one line): G_t = r_(t+1) + gamma * G_(t+1); take E_pi of both sides.
V^*(s) = max_a [ R(s, a) + gamma * sum_s' P(s'|s, a) * V^*(s') ]
Q^*(s, a) = R(s, a) + gamma * sum_s' P(s'|s, a) * max_a' Q^*(s', a')
Expectation -> sum over actions weighted by pi.
Optimality -> MAX over actions.
Chain (deterministic, no cycles), V(D) = 5 given, gamma = 0.8:
V(C) = R(C) + 0.8 * V(D) = 4 + 0.8*5 = 8
V(B) = R(B) + 0.8 * V(C) = 1 + 0.8*8 = 7.4
V(A) = R(A) + 0.8 * V(B) = 2 + 0.8*7.4 = 7.92
Cyclic (H/S MDP, single action, gamma = 0.9):
V(H) = 1 + 0.9 * [ 0.8 V(H) + 0.2 V(S) ]
V(S) = -1 + 0.9 * [ 0.5 V(H) + 0.5 V(S) ]
V appears on both sides => fixed-point system; iterate (Phase 2) or linear-solve.
pi^*(s) = argmax over a of Q^*(s, a)
=> once Q^* is in hand, the optimal policy is trivial. (Why value-based methods work.)
  • Confusing expectation (sum under pi) with optimality (max over actions).
  • Acting on V without a model (need Q for action choice without lookahead).
  • Treating the cyclic Bellman equation as a one-shot substitution (it’s a fixed point).
  • Off-by-one on reward subscripts: a_t in s_t produces r_(t+1), s_(t+1).
  • Believing greedy w.r.t. ESTIMATED Q during learning is safe (need exploration; lesson 8).
  • V(s) / Q(s, a): state-value / action-value — expected returns.
  • Bellman expectation equation: V or Q under a specific policy, as a recursion.
  • Bellman optimality equation: V^* or Q^* as a recursion using max over actions.
  • Fixed-point equation: V on both sides; solved by iteration or linear algebra.
  • Greedy policy: pi(s) = argmax_a Q(s, a); optimal when Q = Q^*.