Skip to content

References: Function approximation and deep RL

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 6:
Value Function Approximation
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the value-function-approximation
material in Silver's Lecture 6 (linear and neural-network parameterization,
the semi-gradient TD update, the deadly triad) and adds the DQN bridge with
the two engineering fixes (experience replay, target network) at the close.
The one-feature linear-Q worked example illustrating generalization across
states from a single transition, and the step-size-too-large overshoot
diagnostic in practice, are Clawdemy framing designed to make the
"function approximation is delicate" intuition concrete. Exact per-lecture
URLs are verified at promotion. The DQN paper reference (Mnih et al. 2015)
is given in "Going deeper" below.

A short, durable list. Both are free.

  • Mnih et al., “Human-level control through deep reinforcement learning” (Nature, 2015) — the DQN paper. The full algorithm (Q-learning + a deep convolutional network + experience replay + target network) and the Atari benchmark that catalyzed modern deep RL. Available widely online.
  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 11 (Off-policy Methods with Approximation). The textbook treatment of the deadly triad and the practical remedies, with the theoretical analysis behind the fixes DQN uses.

Where this leads inside this track.

  • Q-learning: model-free control. The previous lesson. This lesson takes Q-learning’s exact update and replaces the table with a function approximator; the recursion does not change.
  • Temporal-difference learning. Lesson 7. The deadly triad was named there; this lesson is where the third leg (function approximation) is added and the triad becomes a real engineering concern.
  • Policy gradient and the path to modern RL. The next lesson and the close of the track. The function-approximation move done here on V/Q is paralleled there for the policy itself, learning a parameterized policy directly. The two halves together (value-based + policy-based) cover the modern landscape and bridge to RLHF.