RL fundamentals: cheatsheet

The MDP tuple

(S, A, P, R, γ):

Symbol	What it is
`S`	set of states
`A`	set of actions
`P(s’	s, a)`
`R(s, a)`	reward for action `a` in state `s`
`γ ∈ [0, 1)`	discount factor

The Markov property

P(s_(t+1), r_t | s_t, a_t, s_(t-1), a_(t-1), ...) = P(s_(t+1), r_t | s_t, a_t)

Future depends only on current (s, a). It is a property of your state representation, not of the world (one Atari frame: not Markov; four stacked frames: Markov).

The policy

Deterministic: a = π(s). Stochastic: π(a | s).
In deep RL: π_θ parameterized by a neural net.
Objective: maximize expected discounted return E[G_0] = E[Σ γ^t r_t].

Value functions

Object	Definition	Reads as
`V^π(s)`	`E[G_t	s_t = s, follow π]`
`Q^π(s, a)`	`E[G_t	s_t = s, a_t = a, follow π]`
`A^π(s, a)`	`Q^π(s, a) - V^π(s)`	how much better is `a` than `π`’s average from `s`

Optimal: V*(s) = max_π V^π(s), Q*(s, a) = max_π Q^π(s, a). Optimal policy is greedy: π*(s) = argmax_a Q*(s, a).

The Bellman equation (for a fixed policy)

Q^π(s, a) = R(s, a) + γ · Σ over s' of P(s' | s, a) · V^π(s')
V^π(s)   = Σ over a of π(a | s) · Q^π(s, a)
           = Σ over a of π(a | s) · [ R(s, a) + γ · Σ over s' P(s' | s, a) · V^π(s') ]

For a finite MDP and fixed π, this is a linear system in the V^π(s). Solve it.

Bellman optimality (for V, Q)

V*(s)    = max over a [ R(s, a) + γ · Σ over s' P(s' | s, a) · V*(s') ]
Q*(s, a) = R(s, a) + γ · Σ over s' P(s' | s, a) · max over a' of Q*(s', a')

The max makes this non-linear (no closed form). Solved by value iteration / policy iteration in classical RL; by Q-learning + neural nets in deep RL (lessons 6-7).

Worked: a 2-state loop, γ = 0.9

s0 --go--> s1   reward 1
s1 --go--> s0   reward 0    (one action available)

Bellman:

V(s0) = 1 + γ · V(s1)
V(s1) = 0 + γ · V(s0)
=> V(s0) = 1 + γ² · V(s0)  =>  V(s0) = 1 / (1 - γ²) = 1 / 0.19 ≈ 5.263
=> V(s1) = γ · V(s0) ≈ 4.737

Geometric check: from s0, rewards are 1, 0, 1, 0, ...; discounted sum 1 + γ² + γ⁴ + ... = 1 / (1 - γ²) ✓.

The deep-RL dispatch table

Every method in the rest of the track is estimating one of these objects:

Family	Estimates	Method
Policy gradient (L4, L5, L8)	`π_θ`	gradient ascent on expected return
Value-based (L6, L7)	`Q_θ`	minimize Bellman error, act greedily
Actor-critic (L5)	`π_θ` and `V_φ` / `Q_φ`	critic provides lower-variance training signal for actor
Advanced PG (L8 TRPO/PPO)	`π_θ` using `A^π = Q^π - V^π`	advantage as variance-reduced signal
Model-based (L9, L10)	`P(s’	s, a)`

When you read a deep-RL paper: which of π, V, Q, A, P is it estimating, and how is it improving the estimate?

Pitfalls to dodge

V vs Q. V^π(s) is keyed by state alone; Q^π(s, a) is keyed by state-action. They answer different questions.
Markov assumption is on the state. If your state omits relevant context (one Atari frame, no history), it is not Markov. Add the context.
V*, Q* are targets, not estimates. They are defined by the MDP; the agent’s V_θ, Q_θ are what change as it learns.
Bellman is a constraint, not a definition. Value is defined as the expected return; Bellman is the recursion value automatically satisfies.

The one-line version

An MDP is (S, A, P, R, γ) with the Markov property; the value functions V^π, Q^π, A^π satisfy the Bellman equation; every algorithm in this track estimates one of π, V, Q, A, or P from data.