Skip to content

Cheatsheet: Function approximation and deep RL

Replace the table with a parameterized function Q_theta. Minimize the squared TD error against the Bellman target with a semi-gradient step. Deal with the deadly triad (TD + off-policy + function approximation) via DQN’s experience replay and target network.

Loss(theta) = E[ ( target - Q_theta(s, a) )^2 ],
target = r + gamma * max_{a'} Q_theta(s', a') (Q-learning)
Semi-gradient update (target treated as fixed when computing gradient):
delta = target - Q_theta(s, a)
theta <- theta + eta * delta * grad_theta Q_theta(s, a)
Q_theta(x) = theta_0 + theta_1 * x. theta = (0, 0).
Observe (x = 2, r = 1, x' = 3), gamma = 0.9, eta = 0.1.
Q(2) = 0, Q(3) = 0.
target = 1 + 0.9 * 0 = 1.
delta = 1 - 0 = 1.
grad = (1, x) = (1, 2).
theta <- (0 + 0.1*1*1, 0 + 0.1*1*2) = (0.1, 0.2).
After update: Q(2) = 0.5, Q(3) = 0.7, Q(0) = 0.1.
=> ONE transition moves Q at EVERY x via the two shared parameters.
ANY TWO are usually fine. ALL THREE together can diverge:
TD bootstrap -- target uses an estimate Q_theta(s')
+ off-policy -- target uses max over actions, not what was taken
+ function approximation -- updates change Q everywhere via shared parameters
=> Updates can chase moving targets that they themselves move.
FixWhat it doesWhy it helps
Experience replayStore transitions in a buffer; sample random minibatchesDecorrelation (closer to i.i.d. for SGD) + data reuse (each transition contributes to many updates)
Target networkSlowly-updated frozen copy theta-minus used in the targetLive Q is not chasing its own tail; stable regression goal for a while, then sync

DQN recipe = Q-learning + (C)NN + experience replay + target network. Atari-at-human-level (Mnih et al. 2015) is this combination.

Step-size caveat (function approximation makes this delicate)

Section titled “Step-size caveat (function approximation makes this delicate)”
grad_theta Q includes the feature value (linear Q: d/d_theta_1 = x).
-> Large features amplify updates.
-> Too-large eta can overshoot the target dramatically.
=> Feature scaling and careful eta selection matter.
  • Believing the Bellman recursion changes with function approximation (it doesn’t).
  • Using full residual-gradient instead of semi-gradient (theoretically nicer, empirically worse).
  • Running deep Q-learning without target net or experience replay (deadly triad bites).
  • Treating function approximation as just a “scaling” issue (it introduces real algorithmic challenges).
  • Confusing the target network with a separate behavior or target policy (it’s the same Q network, weights frozen).
  • Q_theta(s, a): parameterized action-value function (linear features or NN).
  • Semi-gradient: gradient of the loss treating the target as fixed.
  • Deadly triad: TD bootstrap + off-policy + function approximation; potential divergence.
  • Experience replay: buffer + random minibatches; decorrelation + data reuse.
  • Target network theta-minus: frozen Q copy for the target; stabilizes training.
  • DQN: Q-learning + NN + experience replay + target network.