References: DQN (the deep Q-learning engineering recipe)
Primary sources (load-bearing for this lesson)
Section titled “Primary sources (load-bearing for this lesson)”The DQN paper
Section titled “The DQN paper”- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533. https://www.nature.com/articles/nature14236 The recipe: convolutional Q-network, replay buffer, target network. The 49-game Atari benchmark.
- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. NeurIPS Deep Learning Workshop 2013. https://arxiv.org/abs/1312.5602 The earlier workshop paper that introduced the architecture; the Nature paper is the canonical citation.
Double Q-learning
Section titled “Double Q-learning”- van Hasselt, H. (2010). Double Q-learning. NeurIPS 2010. https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html Original double Q-learning algorithm with two independent online networks updated on alternating batches.
- van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016. https://arxiv.org/abs/1509.06461 Double DQN: reuses the existing target network instead of a second online network. Empirical Atari improvement attributable specifically to the overestimation fix.
The Rainbow combination
Section titled “The Rainbow combination”- Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI 2018. https://arxiv.org/abs/1710.02298 Six improvements combined on top of DQN: double Q, prioritized replay, dueling, multi-step, distributional, noisy nets. Strongest value-based Atari results for several years.
Component refinements (each a row in the Rainbow paper)
Section titled “Component refinements (each a row in the Rainbow paper)”- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. ICLR 2016. https://arxiv.org/abs/1511.05952 Sample high-TD-error transitions more often instead of uniformly.
- Wang, Z., Schaul, T., Hessel, M., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML 2016. https://arxiv.org/abs/1511.06581 Architectural split into
V(s)andA(s, a)streams. - Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML 2017. https://arxiv.org/abs/1707.06887 C51: predict the full distribution of returns, not just the mean.
- Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit Quantile Networks for Distributional Reinforcement Learning. ICML 2018. https://arxiv.org/abs/1806.06923 IQN: parametric quantile-function variant of distributional Q.
- Fortunato, M., Azar, M. G., Piot, B., et al. (2018). Noisy Networks for Exploration. ICLR 2018. https://arxiv.org/abs/1706.10295 Parametric noise in network weights for exploration, replacing ε-greedy.
The Atari Learning Environment (benchmark infrastructure)
Section titled “The Atari Learning Environment (benchmark infrastructure)”- Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253-279. https://arxiv.org/abs/1207.4708 The standardized Atari benchmark that DQN used.
Order statistics (worked-example math)
Section titled “Order statistics (worked-example math)”- David, H. A., & Nagaraja, H. N. (2003). Order Statistics (3rd ed.). Wiley. Reference for the max-of-n-iid-Gaussians moments used in the overestimation-bias derivation. Chapter 4 covers exact moments for n ≤ 5 and asymptotic results for large n.
- Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. The classical reference with the closed-form
1/√πforE[max(X, Y)]derivation.
Berkeley CS285 (course source for this track)
Section titled “Berkeley CS285 (course source for this track)”- Levine, S. (2023). CS285 lecture on Deep RL with Q-Functions. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Lecture slides + video covering DQN, double Q, dueling, and the deadly-triad context from Lecture 7.
Sutton & Barto reference chapters
Section titled “Sutton & Barto reference chapters”- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html
- Chapter 6 (TD learning) and Chapter 11 (off-policy with function approximation): same material as Lesson 6, foundation for understanding what DQN is patching.
- Chapter 7 (n-step bootstrapping): for the multi-step variant in Rainbow.
- Chapter 16 (Applications and Case Studies): includes a DQN-Atari case study.
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine) Course page: http://rail.eecs.berkeley.edu/deeprlcourse/ Lecture videos: YouTube (link-out only)Clawdemy's lessons are original prose that follows the pedagogical arc of thissource. We do not reproduce or transcribe it; we cite it as a recommendedcompanion. All rights to the original material remain with its authors.