References: Value-based RL (Q-learning, the deadly triad)
Primary sources (load-bearing for this lesson)
Section titled “Primary sources (load-bearing for this lesson)”The Q-learning algorithm
Section titled “The Q-learning algorithm”- Watkins, C. J. C. H. (1989). Learning from delayed rewards (Ph.D. dissertation, King’s College, Cambridge). The original Q-learning algorithm. The thesis is hard to find online; the published convergence proof is:
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. https://link.springer.com/article/10.1007/BF00992698
Bellman optimality, value iteration, contraction-mapping convergence
Section titled “Bellman optimality, value iteration, contraction-mapping convergence”- Bellman, R. (1957). Dynamic Programming. Princeton University Press. The origin of the equation that bears his name. Available in many libraries; reprint by Dover (2003).
- Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control. Athena Scientific. Chapter 2 has the cleanest treatment of contraction-mapping convergence of value iteration.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html
- Chapter 3 (Finite MDPs): Bellman equations.
- Chapter 4 (Dynamic Programming): policy iteration and value iteration in the tabular setting.
- Chapter 6 (Temporal-Difference Learning): TD(0), Sarsa, and the original Q-learning treatment.
The deadly triad
Section titled “The deadly triad”- Sutton & Barto (2018), Chapter 11. “Off-policy Methods with Approximation.” Section 11.3 names the deadly triad and works through Baird’s counter-example explicitly.
- Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674-690. The classical divergence analysis; established that linear TD can diverge off-policy.
- Baird, L. (1995). Residual algorithms: reinforcement learning with function approximation. Proceedings of ICML 1995, 30-37. The canonical counter-example MDP where linear off-policy TD diverges with weights growing unboundedly. https://www.sciencedirect.com/science/article/pii/B978155860377650013X (proceedings collection).
Deep Q-networks (forward reference, lesson 7)
Section titled “Deep Q-networks (forward reference, lesson 7)”- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533. https://www.nature.com/articles/nature14236 The DQN paper, with replay buffer and target network as the engineering that made deep Q-learning work on Atari.
- van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016. https://arxiv.org/abs/1509.06461 The fix for the
maxoverestimation bias.
Berkeley CS285 (course source for this track)
Section titled “Berkeley CS285 (course source for this track)”- Levine, S. (2023). CS285 lecture on Value Function Methods. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Course homepage with slides, video lectures, and assignments. The lecture covers Q-iteration, Q-learning, the deadly triad, and motivates DQN.
Secondary / extension readings
Section titled “Secondary / extension readings”- Riedmiller, M. (2005). Neural Fitted Q Iteration: First Experiences with a Data Efficient Neural Reinforcement Learning Method. ECML 2005. https://link.springer.com/chapter/10.1007/11564096_32 The pre-DQN attempt to combine Q-iteration with neural networks; uses a batch (fitted) approach that side-steps some triad issues.
- Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3), 185-202. https://link.springer.com/article/10.1007/BF00993306 Tabular Q-learning convergence with asynchronous updates, formalizing the Watkins result.
- Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. Proceedings of ICML 2010. The gradient-TD family of algorithms, designed to be safe even with all three triad legs active. Theoretically interesting; not standard in deep RL practice because the engineering tricks in DQN solved the problem differently.
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine) Course page: http://rail.eecs.berkeley.edu/deeprlcourse/ Lecture videos: YouTube (link-out only)Clawdemy's lessons are original prose that follows the pedagogical arc of thissource. We do not reproduce or transcribe it; we cite it as a recommendedcompanion. All rights to the original material remain with its authors.