Value functions and Bellman: cheatsheet

The one idea

Value functions say how good things are; the Bellman equations say value is recursive. Optimal policy = argmax over actions of Q^*.

V and Q

V^pi(s)    = E_pi [ G_t  |  s_t = s ]                   "how good is this state?"
Q^pi(s, a) = E_pi [ G_t  |  s_t = s, a_t = a ]          "how good is this (s, a)?"
V^pi(s)    = sum over a of  pi(a | s) * Q^pi(s, a)      (V = policy-weighted average of Q)

Need Q (not V) for model-free control: V tells you which states are good but
not which action; Q has the action baked in.

Bellman EXPECTATION (under a given policy pi)

V^pi(s)    = sum_a pi(a|s) * [ R(s, a)  +  gamma * sum_s'  P(s'|s, a) * V^pi(s') ]
Q^pi(s, a) = R(s, a)  +  gamma * sum_s'  P(s'|s, a) * sum_a' pi(a'|s') * Q^pi(s', a')

Derivation (one line): G_t = r_(t+1) + gamma * G_(t+1); take E_pi of both sides.

Bellman OPTIMALITY (best over any policy)

V^*(s)    = max_a   [ R(s, a)  +  gamma * sum_s'  P(s'|s, a) * V^*(s') ]
Q^*(s, a) = R(s, a) +  gamma * sum_s'  P(s'|s, a) * max_a' Q^*(s', a')

Expectation -> sum over actions weighted by pi.
Optimality  -> MAX over actions.

The recursion at work

Chain (deterministic, no cycles), V(D) = 5 given, gamma = 0.8:
  V(C) = R(C) + 0.8 * V(D) = 4 + 0.8*5  = 8
  V(B) = R(B) + 0.8 * V(C) = 1 + 0.8*8  = 7.4
  V(A) = R(A) + 0.8 * V(B) = 2 + 0.8*7.4 = 7.92

Cyclic (H/S MDP, single action, gamma = 0.9):
  V(H) = 1  + 0.9 * [ 0.8 V(H) + 0.2 V(S) ]
  V(S) = -1 + 0.9 * [ 0.5 V(H) + 0.5 V(S) ]
  V appears on both sides => fixed-point system; iterate (Phase 2) or linear-solve.

Greedy from Q^*

pi^*(s) = argmax over a of  Q^*(s, a)
=> once Q^* is in hand, the optimal policy is trivial. (Why value-based methods work.)

Pitfalls to dodge

Confusing expectation (sum under pi) with optimality (max over actions).
Acting on V without a model (need Q for action choice without lookahead).
Treating the cyclic Bellman equation as a one-shot substitution (it’s a fixed point).
Off-by-one on reward subscripts: a_t in s_t produces r_(t+1), s_(t+1).
Believing greedy w.r.t. ESTIMATED Q during learning is safe (need exploration; lesson 8).

Words to use precisely

V(s) / Q(s, a): state-value / action-value — expected returns.
Bellman expectation equation: V or Q under a specific policy, as a recursion.
Bellman optimality equation: V^* or Q^* as a recursion using max over actions.
Fixed-point equation: V on both sides; solved by iteration or linear algebra.
Greedy policy: pi(s) = argmax_a Q(s, a); optimal when Q = Q^*.