Summary: RL fundamentals (MDPs, value functions, Bellman)
The first two lessons set the stage; this one makes the stage formal. Reinforcement learning is defined against a single mathematical object, the Markov decision process, and the value functions and Bellman equation it gives you are the language every algorithm in the rest of the track speaks. Once you have the formalism, the rest of T18 stops being a parade of acronyms and becomes a small set of recurring patterns. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- The MDP tuple
(S, A, P, R, γ). StatesS, actionsA, transition probabilitiesP(s' | s, a), reward functionR(s, a), discountγ ∈ [0, 1). The full RL problem is defined by this tuple; everything else in the track is a method for acting in or learning about an MDP. - The Markov property.
P(s_(t+1), r_t | s_t, a_t, history) = P(s_(t+1), r_t | s_t, a_t): future depends only on current state and action. It is a property of the state representation, not the world (one Atari frame: not Markov; four stacked frames: Markov). - The policy
π. Deterministica = π(s)or stochasticπ(a|s). In deep RL it is a neural networkπ_θ. The agent maximizes the expected discounted returnE[G_0] = E[Σ γ^t r_t]. - Three value functions.
V^π(s)(expected return from statesunderπ);Q^π(s, a)(expected return from(s, a), thenπ);A^π(s, a) = Q^π(s, a) - V^π(s)(the advantage, how much better isathanπ’s average froms). Optimal versionsV*,Q*maximize over all policies; the optimal policy is greedy with respect toQ*:π*(s) = argmax_a Q*(s, a). - The Bellman equation.
Q^π(s, a) = R(s, a) + γ · Σ_(s') P(s'|s,a) V^π(s'), withV^π(s) = Σ_a π(a|s) Q^π(s, a). Immediate reward plus discounted expected next-state value. For a fixedπand finite MDP, this is a system of linear equations you can solve. The optimality versionV*(s) = max_a [R(s, a) + γ Σ P(s'|s, a) V*(s')]has amaxoutside and is non-linear (no closed form; solved iteratively). - Worked: 2-state loop,
γ = 0.9.s0 → s1with reward1,s1 → s0with reward0. Bellman givesV(s0) = 1 + γ²·V(s0), soV(s0) = 1/(1-γ²) = 5.263andV(s1) = 4.737. Geometric check: froms0the reward sequence1, 0, 1, 0, ...has discounted sum1 + γ² + γ⁴ + ... = 1/(1-γ²)✓.
What changes for you
Section titled “What changes for you”The rest of T18 stops being a parade of acronyms and becomes a small set of recurring patterns. Every algorithm in the track is estimating one of π, V, Q, A, or P from data, with a neural network, and improving the estimate. Policy-gradient methods (lessons 4, 5, 8) update π_θ by gradient ascent on expected return. Value-based methods (lessons 6, 7) minimize Bellman error on Q_θ and act greedily. Actor-critic methods (lesson 5) train both. Advanced PG methods (lesson 8 TRPO/PPO) use the advantage A^π as a variance-reduced signal. Model-based methods (lessons 9, 10) learn P(s'|s,a) itself. When you read a deep-RL paper, the dispatch question is the same every time: what is this method estimating, and how is it improving the estimate? The same MDP lens also sharpens how you read RL-trained AI systems: “high reward” always means high reward on the chosen (S, A, P, R, γ), with all the state-representation and reward-design choices that implies. Reward hacking is the consequence of optimizing the reward you wrote, not the one you meant. The next lesson takes the most direct route to improving a policy: policy gradients, parameterize π_θ as a neural network and follow the gradient of the expected return with respect to θ.