Skip to content

References: Value-based RL (Q-learning, the deadly triad)

Primary sources (load-bearing for this lesson)

Section titled “Primary sources (load-bearing for this lesson)”
  • Watkins, C. J. C. H. (1989). Learning from delayed rewards (Ph.D. dissertation, King’s College, Cambridge). The original Q-learning algorithm. The thesis is hard to find online; the published convergence proof is:
  • Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. https://link.springer.com/article/10.1007/BF00992698

Bellman optimality, value iteration, contraction-mapping convergence

Section titled “Bellman optimality, value iteration, contraction-mapping convergence”
  • Bellman, R. (1957). Dynamic Programming. Princeton University Press. The origin of the equation that bears his name. Available in many libraries; reprint by Dover (2003).
  • Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific. Chapter 2 has the cleanest treatment of contraction-mapping convergence of value iteration.
  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html
    • Chapter 3 (Finite MDPs): Bellman equations.
    • Chapter 4 (Dynamic Programming): policy iteration and value iteration in the tabular setting.
    • Chapter 6 (Temporal-Difference Learning): TD(0), Sarsa, and the original Q-learning treatment.
  • Sutton & Barto (2018), Chapter 11. “Off-policy Methods with Approximation.” Section 11.3 names the deadly triad and works through Baird’s counter-example explicitly.
  • Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674-690. The classical divergence analysis; established that linear TD can diverge off-policy.
  • Baird, L. (1995). Residual algorithms: reinforcement learning with function approximation. Proceedings of ICML 1995, 30-37. The canonical counter-example MDP where linear off-policy TD diverges with weights growing unboundedly. https://www.sciencedirect.com/science/article/pii/B978155860377650013X (proceedings collection).

Deep Q-networks (forward reference, lesson 7)

Section titled “Deep Q-networks (forward reference, lesson 7)”
  • Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533. https://www.nature.com/articles/nature14236 The DQN paper, with replay buffer and target network as the engineering that made deep Q-learning work on Atari.
  • van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016. https://arxiv.org/abs/1509.06461 The fix for the max overestimation bias.

Berkeley CS285 (course source for this track)

Section titled “Berkeley CS285 (course source for this track)”
  • Levine, S. (2023). CS285 lecture on Value Function Methods. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Course homepage with slides, video lectures, and assignments. The lecture covers Q-iteration, Q-learning, the deadly triad, and motivates DQN.
  • Riedmiller, M. (2005). Neural Fitted Q Iteration: First Experiences with a Data Efficient Neural Reinforcement Learning Method. ECML 2005. https://link.springer.com/chapter/10.1007/11564096_32 The pre-DQN attempt to combine Q-iteration with neural networks; uses a batch (fitted) approach that side-steps some triad issues.
  • Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3), 185-202. https://link.springer.com/article/10.1007/BF00993306 Tabular Q-learning convergence with asynchronous updates, formalizing the Watkins result.
  • Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. Proceedings of ICML 2010. The gradient-TD family of algorithms, designed to be safe even with all three triad legs active. Theoretically interesting; not standard in deep RL practice because the engineering tricks in DQN solved the problem differently.
Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.