Cheatsheet: Q-learning: model-free control
The one idea
Section titled “The one idea”Q-learning is the model-free control algorithm: TD’s sample bootstrap applied to Q with a max-over-actions in the target. Off-policy. Foundation of DQN.
The Q-learning update
Section titled “The Q-learning update”Q(s_t, a_t) <- Q(s_t, a_t) + alpha * [ r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') - Q(s_t, a_t) ]
TD error on Q: r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') - Q(s_t, a_t)target: r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') (Bellman OPTIMALITY, sampled)
Same shape as TD(0) for V, with V replaced by Q and a MAX over next-state actions in target.Same shape as value iteration's update, with the expectation over P replaced by one sample.Q-learning vs SARSA (off-policy vs on-policy)
Section titled “Q-learning vs SARSA (off-policy vs on-policy)”Q-learning: target uses max_{a'} Q(s_(t+1), a') (OFF-POLICY: best next action)SARSA: target uses Q(s_(t+1), a_(t+1)) (ON-POLICY: actual next action)
Q-learning learns the value of always-greedy.SARSA learns the value of the policy actually being followed (including its exploration).Worked example (A/B MDP from L4, alpha=0.5, gamma=0.9, Q_0=0)
Section titled “Worked example (A/B MDP from L4, alpha=0.5, gamma=0.9, Q_0=0)”Behavior trajectory and updates (max over next-state Q in each target): Step 1: (A, stay) -> r=1, s'=A target = 1 + 0.9*max(0,0) = 1 Q(A,stay) = 0.5 Step 2: (A, switch) -> r=0, s'=B target = 0 + 0.9*max(0,0) = 0 Q(A,switch) = 0 Step 3: (B, switch) -> r=2, s'=A target = 2 + 0.9*max(0.5,0) = 2.45 Q(B,switch) = 1.225 Step 4: (A, stay) -> r=1, s'=A target = 1 + 0.9*max(0.5,0) = 1.45 Q(A,stay) = 0.975 Step 5: (B, stay) -> r=-1, s'=B target = -1 + 0.9*max(0,1.225) = 0.1025 Q(B,stay) = 0.05125
After 5 steps: Q(A,stay)=0.975, Q(A,switch)=0, Q(B,stay)=0.05125, Q(B,switch)=1.225Greedy: pi(A)=stay (0.975>0), pi(B)=switch (1.225>0.05125) -> IS pi^* (already!)True Q^*: (10, 9.9, 8.9, 11); Q values still far from Q^* but policy RANKING is correct.Exploration is required (epsilon-greedy)
Section titled “Exploration is required (epsilon-greedy)”behavior_policy(s): with probability epsilon, take a random action otherwise, take argmax_a Q(s, a)
Why required: Q-learning convergence to Q^* requires every (s, a) visited infinitely often. Pure-greedy might never visit some (s, a); their Q stays at initial values forever.Practice: start with high epsilon (explore), anneal to small (exploit).Q-learning -> DQN bridge
Section titled “Q-learning -> DQN bridge”DQN = Q-learning with a NEURAL NETWORK approximating Q, trained to minimize ( target - Q_theta(s, a) )^2 via gradient descent.
Two engineering tricks tame the DEADLY TRIAD (TD + off-policy + function approx): EXPERIENCE REPLAY: store transitions; sample minibatches at random (breaks temporal correlations; reuses data) TARGET NETWORK : slowly-updated copy of Q used in the max of the target (stabilizes training)Pitfalls to dodge
Section titled “Pitfalls to dodge”- Confusing Q-learning target with SARSA target (max vs actual a_(t+1)).
- Running Q-learning with epsilon = 0 (fully greedy; some (s, a) never visited; never converges).
- Treating the greedy policy as the BEHAVIOR (it’s only the TARGET; behavior must explore).
- Believing Q-learning + NN is stable without DQN’s tricks (deadly triad).
- Reading Q values’ absolute magnitudes as the policy (the RANKING is the policy).
Words to use precisely
Section titled “Words to use precisely”- Q-learning target: r_(t+1) + gamma * max_
{a'}Q(s_(t+1), a’). - Off-policy: target reasons about a different action than the behavior takes.
- Behavior policy vs target policy: how the agent acts vs what the algorithm learns the value of.
- Epsilon-greedy: exploration scheme; mostly greedy, occasionally random.
- Deadly triad: TD bootstrap + off-policy + function approximation; can diverge naively.