Skip to content

Cheatsheet: RL fundamentals (MDPs, value functions, Bellman)

(S, A, P, R, γ):

SymbolWhat it is
Sset of states
Aset of actions
`P(s’s, a)`
R(s, a)reward for action a in state s
γ ∈ [0, 1)discount factor
P(s_(t+1), r_t | s_t, a_t, s_(t-1), a_(t-1), ...) = P(s_(t+1), r_t | s_t, a_t)

Future depends only on current (s, a). It is a property of your state representation, not of the world (one Atari frame: not Markov; four stacked frames: Markov).

  • Deterministic: a = π(s). Stochastic: π(a | s).
  • In deep RL: π_θ parameterized by a neural net.
  • Objective: maximize expected discounted return E[G_0] = E[Σ γ^t r_t].
ObjectDefinitionReads as
V^π(s)`E[G_ts_t = s, follow π]`
Q^π(s, a)`E[G_ts_t = s, a_t = a, follow π]`
A^π(s, a)Q^π(s, a) - V^π(s)how much better is a than π’s average from s

Optimal: V*(s) = max_π V^π(s), Q*(s, a) = max_π Q^π(s, a). Optimal policy is greedy: π*(s) = argmax_a Q*(s, a).

Q^π(s, a) = R(s, a) + γ · Σ over s' of P(s' | s, a) · V^π(s')
V^π(s) = Σ over a of π(a | s) · Q^π(s, a)
= Σ over a of π(a | s) · [ R(s, a) + γ · Σ over s' P(s' | s, a) · V^π(s') ]

For a finite MDP and fixed π, this is a linear system in the V^π(s). Solve it.

V*(s) = max over a [ R(s, a) + γ · Σ over s' P(s' | s, a) · V*(s') ]
Q*(s, a) = R(s, a) + γ · Σ over s' P(s' | s, a) · max over a' of Q*(s', a')

The max makes this non-linear (no closed form). Solved by value iteration / policy iteration in classical RL; by Q-learning + neural nets in deep RL (lessons 6-7).

s0 --go--> s1 reward 1
s1 --go--> s0 reward 0 (one action available)

Bellman:

V(s0) = 1 + γ · V(s1)
V(s1) = 0 + γ · V(s0)
=> V(s0) = 1 + γ² · V(s0) => V(s0) = 1 / (1 - γ²) = 1 / 0.19 ≈ 5.263
=> V(s1) = γ · V(s0) ≈ 4.737

Geometric check: from s0, rewards are 1, 0, 1, 0, ...; discounted sum 1 + γ² + γ⁴ + ... = 1 / (1 - γ²) ✓.

Every method in the rest of the track is estimating one of these objects:

FamilyEstimatesMethod
Policy gradient (L4, L5, L8)π_θgradient ascent on expected return
Value-based (L6, L7)Q_θminimize Bellman error, act greedily
Actor-critic (L5)π_θ and V_φ / Q_φcritic provides lower-variance training signal for actor
Advanced PG (L8 TRPO/PPO)π_θ using A^π = Q^π - V^πadvantage as variance-reduced signal
Model-based (L9, L10)`P(s’s, a)`

When you read a deep-RL paper: which of π, V, Q, A, P is it estimating, and how is it improving the estimate?

  • V vs Q. V^π(s) is keyed by state alone; Q^π(s, a) is keyed by state-action. They answer different questions.
  • Markov assumption is on the state. If your state omits relevant context (one Atari frame, no history), it is not Markov. Add the context.
  • V*, Q* are targets, not estimates. They are defined by the MDP; the agent’s V_θ, Q_θ are what change as it learns.
  • Bellman is a constraint, not a definition. Value is defined as the expected return; Bellman is the recursion value automatically satisfies.

An MDP is (S, A, P, R, γ) with the Markov property; the value functions V^π, Q^π, A^π satisfy the Bellman equation; every algorithm in this track estimates one of π, V, Q, A, or P from data.