References: DQN (the deep Q-learning engineering recipe)

Primary sources (load-bearing for this lesson)

The DQN paper

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533. https://www.nature.com/articles/nature14236 The recipe: convolutional Q-network, replay buffer, target network. The 49-game Atari benchmark.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. NeurIPS Deep Learning Workshop 2013. https://arxiv.org/abs/1312.5602 The earlier workshop paper that introduced the architecture; the Nature paper is the canonical citation.

Double Q-learning

van Hasselt, H. (2010). Double Q-learning. NeurIPS 2010. https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html Original double Q-learning algorithm with two independent online networks updated on alternating batches.
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI 2016. https://arxiv.org/abs/1509.06461 Double DQN: reuses the existing target network instead of a second online network. Empirical Atari improvement attributable specifically to the overestimation fix.

The Rainbow combination

Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI 2018. https://arxiv.org/abs/1710.02298 Six improvements combined on top of DQN: double Q, prioritized replay, dueling, multi-step, distributional, noisy nets. Strongest value-based Atari results for several years.

Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. ICLR 2016. https://arxiv.org/abs/1511.05952 Sample high-TD-error transitions more often instead of uniformly.
Wang, Z., Schaul, T., Hessel, M., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML 2016. https://arxiv.org/abs/1511.06581 Architectural split into V(s) and A(s, a) streams.
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML 2017. https://arxiv.org/abs/1707.06887 C51: predict the full distribution of returns, not just the mean.
Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit Quantile Networks for Distributional Reinforcement Learning. ICML 2018. https://arxiv.org/abs/1806.06923 IQN: parametric quantile-function variant of distributional Q.
Fortunato, M., Azar, M. G., Piot, B., et al. (2018). Noisy Networks for Exploration. ICLR 2018. https://arxiv.org/abs/1706.10295 Parametric noise in network weights for exploration, replacing ε-greedy.

The Atari Learning Environment (benchmark infrastructure)

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253-279. https://arxiv.org/abs/1207.4708 The standardized Atari benchmark that DQN used.

Order statistics (worked-example math)

David, H. A., & Nagaraja, H. N. (2003). Order Statistics (3rd ed.). Wiley. Reference for the max-of-n-iid-Gaussians moments used in the overestimation-bias derivation. Chapter 4 covers exact moments for n ≤ 5 and asymptotic results for large n.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. The classical reference with the closed-form 1/√π for E[max(X, Y)] derivation.

Berkeley CS285 (course source for this track)

Levine, S. (2023). CS285 lecture on Deep RL with Q-Functions. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Lecture slides + video covering DQN, double Q, dueling, and the deadly-triad context from Lecture 7.

Sutton & Barto reference chapters

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html
- Chapter 6 (TD learning) and Chapter 11 (off-policy with function approximation): same material as Lesson 6, foundation for understanding what DQN is patching.
- Chapter 7 (n-step bootstrapping): for the multi-step variant in Rainbow.
- Chapter 16 (Applications and Case Studies): includes a DQN-Atari case study.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.

References: DQN (the deep Q-learning engineering recipe)

Primary sources (load-bearing for this lesson)

The DQN paper

Double Q-learning

The Rainbow combination

Component refinements (each a row in the Rainbow paper)

The Atari Learning Environment (benchmark infrastructure)

Order statistics (worked-example math)

Berkeley CS285 (course source for this track)

Sutton & Barto reference chapters

Source material