Skip to content

Summary: RL fundamentals (MDPs, value functions, Bellman)

The first two lessons set the stage; this one makes the stage formal. Reinforcement learning is defined against a single mathematical object, the Markov decision process, and the value functions and Bellman equation it gives you are the language every algorithm in the rest of the track speaks. Once you have the formalism, the rest of T18 stops being a parade of acronyms and becomes a small set of recurring patterns. This is the scan-it-in-five-minutes version.

  • The MDP tuple (S, A, P, R, γ). States S, actions A, transition probabilities P(s' | s, a), reward function R(s, a), discount γ ∈ [0, 1). The full RL problem is defined by this tuple; everything else in the track is a method for acting in or learning about an MDP.
  • The Markov property. P(s_(t+1), r_t | s_t, a_t, history) = P(s_(t+1), r_t | s_t, a_t): future depends only on current state and action. It is a property of the state representation, not the world (one Atari frame: not Markov; four stacked frames: Markov).
  • The policy π. Deterministic a = π(s) or stochastic π(a|s). In deep RL it is a neural network π_θ. The agent maximizes the expected discounted return E[G_0] = E[Σ γ^t r_t].
  • Three value functions. V^π(s) (expected return from state s under π); Q^π(s, a) (expected return from (s, a), then π); A^π(s, a) = Q^π(s, a) - V^π(s) (the advantage, how much better is a than π’s average from s). Optimal versions V*, Q* maximize over all policies; the optimal policy is greedy with respect to Q*: π*(s) = argmax_a Q*(s, a).
  • The Bellman equation. Q^π(s, a) = R(s, a) + γ · Σ_(s') P(s'|s,a) V^π(s'), with V^π(s) = Σ_a π(a|s) Q^π(s, a). Immediate reward plus discounted expected next-state value. For a fixed π and finite MDP, this is a system of linear equations you can solve. The optimality version V*(s) = max_a [R(s, a) + γ Σ P(s'|s, a) V*(s')] has a max outside and is non-linear (no closed form; solved iteratively).
  • Worked: 2-state loop, γ = 0.9. s0 → s1 with reward 1, s1 → s0 with reward 0. Bellman gives V(s0) = 1 + γ²·V(s0), so V(s0) = 1/(1-γ²) = 5.263 and V(s1) = 4.737. Geometric check: from s0 the reward sequence 1, 0, 1, 0, ... has discounted sum 1 + γ² + γ⁴ + ... = 1/(1-γ²) ✓.

The rest of T18 stops being a parade of acronyms and becomes a small set of recurring patterns. Every algorithm in the track is estimating one of π, V, Q, A, or P from data, with a neural network, and improving the estimate. Policy-gradient methods (lessons 4, 5, 8) update π_θ by gradient ascent on expected return. Value-based methods (lessons 6, 7) minimize Bellman error on Q_θ and act greedily. Actor-critic methods (lesson 5) train both. Advanced PG methods (lesson 8 TRPO/PPO) use the advantage A^π as a variance-reduced signal. Model-based methods (lessons 9, 10) learn P(s'|s,a) itself. When you read a deep-RL paper, the dispatch question is the same every time: what is this method estimating, and how is it improving the estimate? The same MDP lens also sharpens how you read RL-trained AI systems: “high reward” always means high reward on the chosen (S, A, P, R, γ), with all the state-representation and reward-design choices that implies. Reward hacking is the consequence of optimizing the reward you wrote, not the one you meant. The next lesson takes the most direct route to improving a policy: policy gradients, parameterize π_θ as a neural network and follow the gradient of the expected return with respect to θ.