Cheatsheet: RL fundamentals (MDPs, value functions, Bellman)
The MDP tuple
Section titled “The MDP tuple”(S, A, P, R, γ):
| Symbol | What it is |
|---|---|
S | set of states |
A | set of actions |
| `P(s’ | s, a)` |
R(s, a) | reward for action a in state s |
γ ∈ [0, 1) | discount factor |
The Markov property
Section titled “The Markov property”P(s_(t+1), r_t | s_t, a_t, s_(t-1), a_(t-1), ...) = P(s_(t+1), r_t | s_t, a_t)Future depends only on current (s, a). It is a property of your state representation, not of the world (one Atari frame: not Markov; four stacked frames: Markov).
The policy
Section titled “The policy”- Deterministic:
a = π(s). Stochastic:π(a | s). - In deep RL:
π_θparameterized by a neural net. - Objective: maximize expected discounted return
E[G_0] = E[Σ γ^t r_t].
Value functions
Section titled “Value functions”| Object | Definition | Reads as |
|---|---|---|
V^π(s) | `E[G_t | s_t = s, follow π]` |
Q^π(s, a) | `E[G_t | s_t = s, a_t = a, follow π]` |
A^π(s, a) | Q^π(s, a) - V^π(s) | how much better is a than π’s average from s |
Optimal: V*(s) = max_π V^π(s), Q*(s, a) = max_π Q^π(s, a). Optimal policy is greedy: π*(s) = argmax_a Q*(s, a).
The Bellman equation (for a fixed policy)
Section titled “The Bellman equation (for a fixed policy)”Q^π(s, a) = R(s, a) + γ · Σ over s' of P(s' | s, a) · V^π(s')V^π(s) = Σ over a of π(a | s) · Q^π(s, a) = Σ over a of π(a | s) · [ R(s, a) + γ · Σ over s' P(s' | s, a) · V^π(s') ]For a finite MDP and fixed π, this is a linear system in the V^π(s). Solve it.
Bellman optimality (for V*, Q*)
Section titled “Bellman optimality (for V*, Q*)”V*(s) = max over a [ R(s, a) + γ · Σ over s' P(s' | s, a) · V*(s') ]Q*(s, a) = R(s, a) + γ · Σ over s' P(s' | s, a) · max over a' of Q*(s', a')The max makes this non-linear (no closed form). Solved by value iteration / policy iteration in classical RL; by Q-learning + neural nets in deep RL (lessons 6-7).
Worked: a 2-state loop, γ = 0.9
Section titled “Worked: a 2-state loop, γ = 0.9”s0 --go--> s1 reward 1s1 --go--> s0 reward 0 (one action available)Bellman:
V(s0) = 1 + γ · V(s1)V(s1) = 0 + γ · V(s0)=> V(s0) = 1 + γ² · V(s0) => V(s0) = 1 / (1 - γ²) = 1 / 0.19 ≈ 5.263=> V(s1) = γ · V(s0) ≈ 4.737Geometric check: from s0, rewards are 1, 0, 1, 0, ...; discounted sum 1 + γ² + γ⁴ + ... = 1 / (1 - γ²) ✓.
The deep-RL dispatch table
Section titled “The deep-RL dispatch table”Every method in the rest of the track is estimating one of these objects:
| Family | Estimates | Method |
|---|---|---|
| Policy gradient (L4, L5, L8) | π_θ | gradient ascent on expected return |
| Value-based (L6, L7) | Q_θ | minimize Bellman error, act greedily |
| Actor-critic (L5) | π_θ and V_φ / Q_φ | critic provides lower-variance training signal for actor |
| Advanced PG (L8 TRPO/PPO) | π_θ using A^π = Q^π - V^π | advantage as variance-reduced signal |
| Model-based (L9, L10) | `P(s’ | s, a)` |
When you read a deep-RL paper: which of π, V, Q, A, P is it estimating, and how is it improving the estimate?
Pitfalls to dodge
Section titled “Pitfalls to dodge”VvsQ.V^π(s)is keyed by state alone;Q^π(s, a)is keyed by state-action. They answer different questions.- Markov assumption is on the state. If your state omits relevant context (one Atari frame, no history), it is not Markov. Add the context.
V*,Q*are targets, not estimates. They are defined by the MDP; the agent’sV_θ,Q_θare what change as it learns.- Bellman is a constraint, not a definition. Value is defined as the expected return; Bellman is the recursion value automatically satisfies.
The one-line version
Section titled “The one-line version”An MDP is (S, A, P, R, γ) with the Markov property; the value functions V^π, Q^π, A^π satisfy the Bellman equation; every algorithm in this track estimates one of π, V, Q, A, or P from data.