Function approximation and deep RL
What you’ll learn
Section titled “What you’ll learn”This is lesson 9 of Track 17 (Reinforcement Learning Foundations) and the opener of Phase 4 (Scaling up). The algorithms in Phases 2 and 3 stored V or Q in tables. Real state spaces (Atari pixels, Go boards, robot joint configurations) are too big to enumerate. The fix is function approximation: replace the table with a parameterized function (linear features, or a neural network), keep the same Bellman recursion, and let the model generalize across states. That single move turns tabular Q-learning into deep Q-learning, and it is what made the Atari, Go, and modern robotics breakthroughs possible. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.
The lesson explains why tables fail, writes the squared-TD-error loss and the semi-gradient update, walks one update on a small linear-Q example to show how one transition moves Q at every state via shared parameters (the generalization payoff), explicitly names the deadly triad as the reason naive deep Q-learning diverges, and shows how DQN’s two engineering fixes (experience replay and a target network) tame it. Practice repeats the gradient step on fresh numbers and includes a step-size-too-large diagnostic that makes the “function approximation is delicate” intuition tangible.
Where this fits
Section titled “Where this fits”This is lesson 9 of 10 and the first lesson of Phase 4 (the final phase, Scaling up). It takes everything from Phases 2 and 3 (Bellman recursion, TD bootstrap, off-policy Q-learning) and applies it to a parameterized representation; the recursion is unchanged. The next and final lesson, Policy gradient and the path to modern RL, does the analogous move for the policy side and bridges to RLHF, completing the track.
Before you start
Section titled “Before you start”Prerequisites: the previous lesson (Q-learning) for the target r + gamma * max_{a'} Q(s’, a’) and the off-policy property that is one leg of the deadly triad; lesson 7 (TD learning) for the bootstrap leg that was named there. Comfort with a gradient (partial derivatives of a scalar w.r.t. a vector of parameters) and basic gradient descent is the only computational background — the linear-Q example uses one-feature partial derivatives that are inspection-level.
About the math
Section titled “About the math”The arithmetic is hand-sized: at each step, compute Q at the current and next states, the TD error, and one gradient (which for linear Q is just (1, x)). The semi-gradient update is one multiplication per parameter. No proofs; the deadly-triad story is told in words with the diagnostic intuition (the gradient propagates through parameters that touch every state), and the DQN fixes are described mechanically.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain why tabular value methods do not scale to large state spaces and what function approximation buys
- Write the squared-TD-error loss and the semi-gradient update for value-based learning with a parameterized Q
- Compute one semi-gradient update on a small linear-Q example
- Name the deadly triad (TD bootstrap + off-policy + function approximation) and why it can diverge
- Explain how DQN’s experience replay and target network address the deadly triad and enable deep value-based RL
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 16 minutes (a self-check, a fresh linear-Q gradient step with a step-size diagnostic, a deadly-triad-and-DQN-fixes matching drill, and flashcards)
- Difficulty: standard (small arithmetic; conceptual challenge is the deadly triad and why the engineering fixes are essential, not optional)