Q-learning: cheatsheet

The one idea

Q-learning is the model-free control algorithm: TD’s sample bootstrap applied to Q with a max-over-actions in the target. Off-policy. Foundation of DQN.

The Q-learning update

Q(s_t, a_t)  <-  Q(s_t, a_t)  +  alpha * [ r_(t+1) + gamma * max_{a'} Q(s_(t+1), a') - Q(s_t, a_t) ]

TD error on Q:    r_(t+1) + gamma * max_{a'} Q(s_(t+1), a')  -  Q(s_t, a_t)
target:           r_(t+1) + gamma * max_{a'} Q(s_(t+1), a')    (Bellman OPTIMALITY, sampled)

Same shape as TD(0) for V, with V replaced by Q and a MAX over next-state actions in target.
Same shape as value iteration's update, with the expectation over P replaced by one sample.

Q-learning vs SARSA (off-policy vs on-policy)

Q-learning:  target uses  max_{a'} Q(s_(t+1), a')           (OFF-POLICY: best next action)
SARSA:       target uses  Q(s_(t+1), a_(t+1))               (ON-POLICY: actual next action)

Q-learning learns the value of always-greedy.
SARSA learns the value of the policy actually being followed (including its exploration).

Worked example (A/B MDP from L4, alpha=0.5, gamma=0.9, Q_0=0)

Behavior trajectory and updates (max over next-state Q in each target):
  Step 1: (A, stay) -> r=1, s'=A   target = 1 + 0.9*max(0,0)        = 1     Q(A,stay)   = 0.5
  Step 2: (A, switch) -> r=0, s'=B target = 0 + 0.9*max(0,0)        = 0     Q(A,switch) = 0
  Step 3: (B, switch) -> r=2, s'=A target = 2 + 0.9*max(0.5,0)      = 2.45  Q(B,switch) = 1.225
  Step 4: (A, stay) -> r=1, s'=A   target = 1 + 0.9*max(0.5,0)      = 1.45  Q(A,stay)   = 0.975
  Step 5: (B, stay) -> r=-1, s'=B  target = -1 + 0.9*max(0,1.225)   = 0.1025 Q(B,stay)  = 0.05125

After 5 steps:  Q(A,stay)=0.975, Q(A,switch)=0, Q(B,stay)=0.05125, Q(B,switch)=1.225
Greedy: pi(A)=stay (0.975>0), pi(B)=switch (1.225>0.05125)  ->  IS pi^*  (already!)
True Q^*: (10, 9.9, 8.9, 11);  Q values still far from Q^* but policy RANKING is correct.

Exploration is required (epsilon-greedy)

behavior_policy(s):
    with probability epsilon, take a random action
    otherwise, take argmax_a Q(s, a)

Why required:
    Q-learning convergence to Q^* requires every (s, a) visited infinitely often.
    Pure-greedy might never visit some (s, a); their Q stays at initial values forever.
Practice: start with high epsilon (explore), anneal to small (exploit).

Q-learning -> DQN bridge

DQN = Q-learning with a NEURAL NETWORK approximating Q, trained to minimize
      ( target - Q_theta(s, a) )^2  via gradient descent.

Two engineering tricks tame the DEADLY TRIAD (TD + off-policy + function approx):
  EXPERIENCE REPLAY: store transitions; sample minibatches at random
                     (breaks temporal correlations; reuses data)
  TARGET NETWORK   : slowly-updated copy of Q used in the max of the target
                     (stabilizes training)

Pitfalls to dodge

Confusing Q-learning target with SARSA target (max vs actual a_(t+1)).
Running Q-learning with epsilon = 0 (fully greedy; some (s, a) never visited; never converges).
Treating the greedy policy as the BEHAVIOR (it’s only the TARGET; behavior must explore).
Believing Q-learning + NN is stable without DQN’s tricks (deadly triad).
Reading Q values’ absolute magnitudes as the policy (the RANKING is the policy).

Words to use precisely

Q-learning target: r_(t+1) + gamma * max_{a'} Q(s_(t+1), a’).
Off-policy: target reasons about a different action than the behavior takes.
Behavior policy vs target policy: how the agent acts vs what the algorithm learns the value of.
Epsilon-greedy: exploration scheme; mostly greedy, occasionally random.
Deadly triad: TD bootstrap + off-policy + function approximation; can diverge naively.