Cheatsheet: Function approximation and deep RL
The one idea
Section titled “The one idea”Replace the table with a parameterized function Q_theta. Minimize the squared TD error against the Bellman target with a semi-gradient step. Deal with the deadly triad (TD + off-policy + function approximation) via DQN’s experience replay and target network.
The objective and update
Section titled “The objective and update”Loss(theta) = E[ ( target - Q_theta(s, a) )^2 ], target = r + gamma * max_{a'} Q_theta(s', a') (Q-learning)
Semi-gradient update (target treated as fixed when computing gradient): delta = target - Q_theta(s, a) theta <- theta + eta * delta * grad_theta Q_theta(s, a)Worked one-step (linear Q)
Section titled “Worked one-step (linear Q)”Q_theta(x) = theta_0 + theta_1 * x. theta = (0, 0).Observe (x = 2, r = 1, x' = 3), gamma = 0.9, eta = 0.1.
Q(2) = 0, Q(3) = 0.target = 1 + 0.9 * 0 = 1.delta = 1 - 0 = 1.grad = (1, x) = (1, 2).theta <- (0 + 0.1*1*1, 0 + 0.1*1*2) = (0.1, 0.2).
After update: Q(2) = 0.5, Q(3) = 0.7, Q(0) = 0.1.=> ONE transition moves Q at EVERY x via the two shared parameters.The deadly triad
Section titled “The deadly triad”ANY TWO are usually fine. ALL THREE together can diverge:
TD bootstrap -- target uses an estimate Q_theta(s')+ off-policy -- target uses max over actions, not what was taken+ function approximation -- updates change Q everywhere via shared parameters
=> Updates can chase moving targets that they themselves move.DQN’s two fixes
Section titled “DQN’s two fixes”| Fix | What it does | Why it helps |
|---|---|---|
| Experience replay | Store transitions in a buffer; sample random minibatches | Decorrelation (closer to i.i.d. for SGD) + data reuse (each transition contributes to many updates) |
| Target network | Slowly-updated frozen copy theta-minus used in the target | Live Q is not chasing its own tail; stable regression goal for a while, then sync |
DQN recipe = Q-learning + (C)NN + experience replay + target network. Atari-at-human-level (Mnih et al. 2015) is this combination.
Step-size caveat (function approximation makes this delicate)
Section titled “Step-size caveat (function approximation makes this delicate)”grad_theta Q includes the feature value (linear Q: d/d_theta_1 = x).-> Large features amplify updates.-> Too-large eta can overshoot the target dramatically.=> Feature scaling and careful eta selection matter.Pitfalls to dodge
Section titled “Pitfalls to dodge”- Believing the Bellman recursion changes with function approximation (it doesn’t).
- Using full residual-gradient instead of semi-gradient (theoretically nicer, empirically worse).
- Running deep Q-learning without target net or experience replay (deadly triad bites).
- Treating function approximation as just a “scaling” issue (it introduces real algorithmic challenges).
- Confusing the target network with a separate behavior or target policy (it’s the same Q network, weights frozen).
Words to use precisely
Section titled “Words to use precisely”- Q_theta(s, a): parameterized action-value function (linear features or NN).
- Semi-gradient: gradient of the loss treating the target as fixed.
- Deadly triad: TD bootstrap + off-policy + function approximation; potential divergence.
- Experience replay: buffer + random minibatches; decorrelation + data reuse.
- Target network theta-minus: frozen Q copy for the target; stabilizes training.
- DQN: Q-learning + NN + experience replay + target network.