Skip to content

Cheatsheet: Q-learning: model-free control

Q-learning is the model-free control algorithm: TD’s sample bootstrap applied to Q with a max-over-actions in the target. Off-policy. Foundation of DQN.

Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [ r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') - Q(s_t, a_t) ]
TD error on Q: r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') - Q(s_t, a_t)
target: r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') (Bellman OPTIMALITY, sampled)
Same shape as TD(0) for V, with V replaced by Q and a MAX over next-state actions in target.
Same shape as value iteration's update, with the expectation over P replaced by one sample.

Q-learning vs SARSA (off-policy vs on-policy)

Section titled “Q-learning vs SARSA (off-policy vs on-policy)”
Q-learning: target uses max_{a'} Q(s_(t+1), a') (OFF-POLICY: best next action)
SARSA: target uses Q(s_(t+1), a_(t+1)) (ON-POLICY: actual next action)
Q-learning learns the value of always-greedy.
SARSA learns the value of the policy actually being followed (including its exploration).

Worked example (A/B MDP from L4, alpha=0.5, gamma=0.9, Q_0=0)

Section titled “Worked example (A/B MDP from L4, alpha=0.5, gamma=0.9, Q_0=0)”
Behavior trajectory and updates (max over next-state Q in each target):
Step 1: (A, stay) -> r=1, s'=A target = 1 + 0.9*max(0,0) = 1 Q(A,stay) = 0.5
Step 2: (A, switch) -> r=0, s'=B target = 0 + 0.9*max(0,0) = 0 Q(A,switch) = 0
Step 3: (B, switch) -> r=2, s'=A target = 2 + 0.9*max(0.5,0) = 2.45 Q(B,switch) = 1.225
Step 4: (A, stay) -> r=1, s'=A target = 1 + 0.9*max(0.5,0) = 1.45 Q(A,stay) = 0.975
Step 5: (B, stay) -> r=-1, s'=B target = -1 + 0.9*max(0,1.225) = 0.1025 Q(B,stay) = 0.05125
After 5 steps: Q(A,stay)=0.975, Q(A,switch)=0, Q(B,stay)=0.05125, Q(B,switch)=1.225
Greedy: pi(A)=stay (0.975>0), pi(B)=switch (1.225>0.05125) -> IS pi^* (already!)
True Q^*: (10, 9.9, 8.9, 11); Q values still far from Q^* but policy RANKING is correct.
behavior_policy(s):
with probability epsilon, take a random action
otherwise, take argmax_a Q(s, a)
Why required:
Q-learning convergence to Q^* requires every (s, a) visited infinitely often.
Pure-greedy might never visit some (s, a); their Q stays at initial values forever.
Practice: start with high epsilon (explore), anneal to small (exploit).
DQN = Q-learning with a NEURAL NETWORK approximating Q, trained to minimize
( target - Q_theta(s, a) )^2 via gradient descent.
Two engineering tricks tame the DEADLY TRIAD (TD + off-policy + function approx):
EXPERIENCE REPLAY: store transitions; sample minibatches at random
(breaks temporal correlations; reuses data)
TARGET NETWORK : slowly-updated copy of Q used in the max of the target
(stabilizes training)
  • Confusing Q-learning target with SARSA target (max vs actual a_(t+1)).
  • Running Q-learning with epsilon = 0 (fully greedy; some (s, a) never visited; never converges).
  • Treating the greedy policy as the BEHAVIOR (it’s only the TARGET; behavior must explore).
  • Believing Q-learning + NN is stable without DQN’s tricks (deadly triad).
  • Reading Q values’ absolute magnitudes as the policy (the RANKING is the policy).
  • Q-learning target: r_(t+1) + gamma * max_{a'} Q(s_(t+1), a’).
  • Off-policy: target reasons about a different action than the behavior takes.
  • Behavior policy vs target policy: how the agent acts vs what the algorithm learns the value of.
  • Epsilon-greedy: exploration scheme; mostly greedy, occasionally random.
  • Deadly triad: TD bootstrap + off-policy + function approximation; can diverge naively.