Skip to content

Function approximation and deep RL

This is lesson 9 of Track 17 (Reinforcement Learning Foundations) and the opener of Phase 4 (Scaling up). The algorithms in Phases 2 and 3 stored V or Q in tables. Real state spaces (Atari pixels, Go boards, robot joint configurations) are too big to enumerate. The fix is function approximation: replace the table with a parameterized function (linear features, or a neural network), keep the same Bellman recursion, and let the model generalize across states. That single move turns tabular Q-learning into deep Q-learning, and it is what made the Atari, Go, and modern robotics breakthroughs possible. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson explains why tables fail, writes the squared-TD-error loss and the semi-gradient update, walks one update on a small linear-Q example to show how one transition moves Q at every state via shared parameters (the generalization payoff), explicitly names the deadly triad as the reason naive deep Q-learning diverges, and shows how DQN’s two engineering fixes (experience replay and a target network) tame it. Practice repeats the gradient step on fresh numbers and includes a step-size-too-large diagnostic that makes the “function approximation is delicate” intuition tangible.

This is lesson 9 of 10 and the first lesson of Phase 4 (the final phase, Scaling up). It takes everything from Phases 2 and 3 (Bellman recursion, TD bootstrap, off-policy Q-learning) and applies it to a parameterized representation; the recursion is unchanged. The next and final lesson, Policy gradient and the path to modern RL, does the analogous move for the policy side and bridges to RLHF, completing the track.

Prerequisites: the previous lesson (Q-learning) for the target r + gamma * max_{a'} Q(s’, a’) and the off-policy property that is one leg of the deadly triad; lesson 7 (TD learning) for the bootstrap leg that was named there. Comfort with a gradient (partial derivatives of a scalar w.r.t. a vector of parameters) and basic gradient descent is the only computational background — the linear-Q example uses one-feature partial derivatives that are inspection-level.

The arithmetic is hand-sized: at each step, compute Q at the current and next states, the TD error, and one gradient (which for linear Q is just (1, x)). The semi-gradient update is one multiplication per parameter. No proofs; the deadly-triad story is told in words with the diagnostic intuition (the gradient propagates through parameters that touch every state), and the DQN fixes are described mechanically.

  • Explain why tabular value methods do not scale to large state spaces and what function approximation buys
  • Write the squared-TD-error loss and the semi-gradient update for value-based learning with a parameterized Q
  • Compute one semi-gradient update on a small linear-Q example
  • Name the deadly triad (TD bootstrap + off-policy + function approximation) and why it can diverge
  • Explain how DQN’s experience replay and target network address the deadly triad and enable deep value-based RL
  • Read time: about 13 minutes
  • Practice time: about 16 minutes (a self-check, a fresh linear-Q gradient step with a step-size diagnostic, a deadly-triad-and-DQN-fixes matching drill, and flashcards)
  • Difficulty: standard (small arithmetic; conceptual challenge is the deadly triad and why the engineering fixes are essential, not optional)