Skip to content

RL fundamentals (MDPs, returns, value, and policy)

Lesson 1 framed the agent-environment loop; lesson 2 walked through behavioral cloning and the way it fails. This lesson makes the loop precise. The single capability it builds: state an RL problem as an MDP, write down its value and Q-functions, recognize the Bellman recursion, and solve a small MDP by hand, so the rest of the track can derive its algorithms rather than assert them.

You will meet the MDP tuple (S, A, P, R, γ) (states, actions, transition probabilities, reward function, discount), the Markov property (and why it is a property of the state representation, not the world: one Atari frame is not Markov, four stacked frames is), the policy π (deterministic or stochastic; in deep RL a neural network), and the three value functions: V^π(s) (expected return from state s under π), Q^π(s, a) (expected return from (s, a) then π), and the advantage A^π(s, a) = Q^π(s, a) - V^π(s) (how much better is action a than π’s average from s). You will derive the Bellman equation V^π(s) = Σ_a π(a|s) [R(s,a) + γ Σ_(s') P(s'|s,a) V^π(s')], see that for a fixed policy it is a system of linear equations, solve a 2-state loop MDP by hand (V(s0) = 1/(1-γ²) = 5.263 at γ = 0.9), verify against the discounted geometric sum, and meet the Bellman optimality equation V*(s) = max_a [...] plus the consequence that the optimal policy is greedy with respect to Q*. The lesson closes with the deep-RL dispatch table: every later algorithm in the track estimates one of π, V, Q, A, or P.

This is lesson 3 of Phase 1 (RL foundations), the formal-language lesson the rest of the track is defined against. Lesson 4 takes the first algorithmic move (policy gradients) using the value-function vocabulary built here; lesson 5 adds the critic to get actor-critic; lessons 6-7 move to value-based RL (Q-learning, DQN), which is direct Bellman-error minimization on Q; lesson 8’s TRPO and PPO use the advantage A^π introduced here; lessons 9-10 add a learned model P_φ. Across the track, every algorithm fits the dispatch table this lesson ends on.

Prerequisite (within this track): lesson 2, Imitation learning and behavioral cloning, for the sequential context (lesson 2 motivated the need for the formal RL language by showing how BC fails on long horizons). The lesson assumes comfort with summation notation, conditional expectations, and small systems of linear equations; if any of those is rusty, T4 (Linear Algebra) covers the algebra and Sutton & Barto Chapter 3 covers the conditional expectations. Cross-track: T17 (RL Foundations, in parallel) covers the classical-RL material at greater depth (value iteration, policy iteration, temporal-difference learning); this lesson assumes the MDP material in T17 Chapters 3-4 (or Sutton-and-Barto Chapter 3) at orientation level.

  • State the MDP tuple (S, A, P, R, γ) and the Markov property, and explain that the property is a feature of the state representation rather than the world
  • Define the policy (deterministic and stochastic), the return, and the three value functions V^π, Q^π, A^π = Q^π - V^π
  • Write the Bellman equation for a fixed policy and explain why it is a linear system for finite MDPs
  • Solve a small Bellman system by hand and verify it against the direct discounted geometric sum
  • Recognize the optimal value functions V*, Q* and the Bellman optimality recursion, and read the deep-RL “dispatch table” of what each algorithm family estimates
  • Read time: about 13 minutes
  • Practice time: about 13 minutes (a fresh 2-state Bellman solve with geometric verification, an MDP-vs-non-Markov scenario drill, and flashcards)
  • Difficulty: standard