Cheatsheet: Value-based RL (Q-learning, the deadly triad)
Bellman optimality equation
Section titled “Bellman optimality equation”Q*(s, a) = R(s, a) + γ · E_{s' ~ P(·|s, a)} [ max_{a'} Q*(s', a') ]The max makes this non-linear in Q*. No closed-form solution. Must iterate.
Optimal policy: π*(s) = argmax_a Q*(s, a). No separate policy network needed.
Two algorithms
Section titled “Two algorithms”| Algorithm | When | Requires | Sample form |
|---|---|---|---|
| Q-iteration | You know the MDP (P and R) | Full dynamics model | Q_{k+1}(s,a) ← R(s,a) + γ · Σ_{s'} P(s'│s,a) · max_{a'} Q_k(s', a') |
| Q-learning | Model-free (only observe transitions) | Just samples (s, a, r, s') | Q(s,a) ← Q(s,a) + α · (r + γ · max_{a'} Q(s', a') - Q(s,a)) |
Q-iteration converges geometrically (γ-contraction in sup-norm): ||Q_k - Q*||_∞ ≤ γ^k · ||Q_0 - Q*||_∞.
Q-learning converges with probability one in the tabular setting under appropriate step-size schedules (Watkins 1989).
On-policy vs off-policy
Section titled “On-policy vs off-policy”| Property | On-policy (REINFORCE, A2C, PPO) | Off-policy (Q-learning, DQN, SAC) |
|---|---|---|
| Data must come from | Current policy π_θ | Any policy |
| Update target uses | Same π_θ | Different policy (greedy / max) |
| Replay buffer | Hard (needs importance sampling) | Natural fit |
| Exploration | Built into stochastic policy | Add separately (ε-greedy, entropy bonus) |
| Sample efficiency | Lower (fresh data per update) | Higher (reuse old data) |
Q-learning is off-policy because the target uses max_{a'}, not the action distribution of whatever policy generated the data.
The deadly triad (Sutton)
Section titled “The deadly triad (Sutton)”Combine all three → can diverge:
- Function approximation:
Q_θgeneralizes across states; one weight update shifts manyQvalues. - Bootstrapping: target is
r + γ · max_{a'} Q_θ(s', a'), uses the current network. - Off-policy data: training distribution differs from greedy-policy distribution.
Any two of the three: safe. All three: trouble.
DQN engineering tames each leg:
- Replay buffer → decorrelates samples (closer to i.i.d.).
- Target network
Q_{θ⁻}frozen forNsteps → bootstrapped target stops chasing itself. - Double Q-learning → action selection and evaluation use different networks → fixes
maxoverestimation.
All three covered in Lesson 7.
Two key biases / variance issues
Section titled “Two key biases / variance issues”| Issue | Source | Fix |
|---|---|---|
max overestimation | E[max_a X_a] ≥ max_a E[X_a] by Jensen; noisy Q_θ ⇒ targets are systematically too high | Double Q-learning (decouple selection/evaluation) |
| Bootstrapped target instability | Target r + γ · max_{a'} Q_θ(s', a') moves as θ updates | Target network (freeze θ⁻ for many steps) |
Worked example: Q-iteration on the L3 two-state MDP
Section titled “Worked example: Q-iteration on the L3 two-state MDP”Setup: s0 → s1 reward 1, s1 → s0 reward 0, single action, γ = 0.9.
Analytic answer (from Lesson 3): V*(s0) = 1/(1−γ²) = 5.263, V*(s1) ≈ 4.737. Single action ⇒ Q* = V*.
Q-iteration with Q_0 = 0:
| k | Q_k(s0) | Q_k(s1) |
|---|---|---|
| 1 | 1.000 | 0.000 |
| 2 | 1.000 | 0.900 |
| 3 | 1.810 | 0.900 |
| 4 | 1.810 | 1.629 |
| 5 | 2.466 | 1.629 |
| ∞ | 5.263 | 4.737 |
Errors shrink by γ² = 0.81 per pair of iterations (2-step loop). Matches the closed-form geometric sum exactly. Dual-path validation: contraction algorithm ↔ analytic geometric series.
When to pick the Q branch (decision rubric)
Section titled “When to pick the Q branch (decision rubric)”| Use Q-learning when | Use policy-gradient (Phase 1) when |
|---|---|
Discrete action space (the max is cheap) | Continuous action space |
| Want deterministic optimal policy | Need stochastic policy (multi-agent, partial obs) |
| Have a replay buffer or demonstrations | Can afford fresh on-policy data each update |
| Atari games, board games, gridworlds | Robotics, continuous control, LLM RLHF |
Modern algorithms blend both (SAC = actor-critic + Q-network; PPO = policy gradient + V-critic).
Common pitfalls
Section titled “Common pitfalls”- Bellman optimality ≠ Bellman equation.
Q*usesmax;Q^πusesE_{a' ~ π}. - Calling Q-learning on-policy because data uses ε-greedy. ε-greedy is exploration, not the target.
- Assuming Q-iteration’s convergence proof transfers to deep Q-networks. It does not (deadly triad).
- Forgetting that
argmaxover continuous actions is itself an optimization problem. That’s why DDPG introduces a deterministic actor (SAC uses a stochastic max-entropy actor; both avoid theargmaxby training an actor directly).
What you should remember
Section titled “What you should remember”Q*satisfies Bellman optimality:R + γ E [max Q*]. Non-linear, iterative.- Tabular Q-iteration: contracts at rate
γ, converges toQ*. - Q-learning: model-free sample version, off-policy by construction.
- Deep Q-learning: needs replay buffer + target network + double Q to be stable.
- Optimal policy is
argmax_a Q*(s, a), no separate policy parameterization.