Summary: Function approximation and deep RL
Tables don’t scale; replace them with a parameterized function (linear features or a neural network) and minimize a squared TD error. The Bellman recursion is unchanged; only the representation is. This single move turns tabular Q-learning into deep Q-learning, and the deadly triad (TD + off-policy + function approximation) becomes the engineering problem DQN’s experience replay and target network solve. This summary is the scan-in-five-minutes version of the full lesson.
Core ideas
Section titled “Core ideas”- Why tables fail. A few thousand discrete states can be tabular; pixels, continuous sensors, joint configurations cannot. The state space is effectively infinite and demands a function approximator from the start.
- The function-approximation move. Replace Q(s, a) with Q_theta(s, a), parameterized by theta. Linear: Q_theta(s, a) = theta . phi(s, a). Neural network: theta is the network’s weights. A small theta represents Q across an enormous (or infinite) state space.
- Objective and update. Minimize Loss(theta) = E[ (target - Q_theta(s, a))^2 ] with target = r + gamma * max_
{a'}Q_theta(s’, a’) (Q-learning). The semi-gradient update is theta<-theta + eta * delta * grad_theta Q_theta(s, a), with delta the TD error and the target treated as fixed when computing the gradient. - One step on a linear Q. Q_theta(x) = theta_0 + theta_1 * x; theta = (0, 0); observe (x = 2, r = 1, x’ = 3); gamma = 0.9; eta = 0.1. delta = 1; grad = (1, 2); theta
<-(0.1, 0.2). After update Q(2) = 0.5, Q(3) = 0.7: one transition moves Q at all x via shared parameters. That is generalization. - The deadly triad in action. TD bootstrap (target is an estimate) + off-policy (target uses max, not the action taken) + function approximation (updates ripple across all states via theta). Any two are usually fine; all three can diverge.
- DQN’s two fixes. (1) Experience replay: minibatches of random transitions from a buffer (decorrelation + data reuse). (2) Target network: a slowly-updated frozen copy theta-minus of Q used in the target, so the live network is not chasing its own tail. Together they make deep Q-learning stable.
What changes for you
Section titled “What changes for you”You have the move that turns the algorithms in lessons 2-8 into the deep RL systems you have heard about. Whenever you read about a value-based deep RL paper, the recipe is the lesson’s Bellman target + a neural network + experience replay + target network + extensions on top (double DQN, dueling DQN, prioritized replay, distributional Q-learning, rainbow). The most actionable takeaway is the deadly triad as a real engineering concern: a “training instability” in a value-based deep RL setting almost always traces to one of the three pieces, and the fixes are the toolkit. The other actionable takeaway is generalization is the point of the move, and the lesson’s small linear-Q example shows it concretely: one update moves Q at every state. The final lesson takes the same scaling step for the policy side, with policy gradients and the bridge to RLHF.