References: Exploration

Primary source

Levine, S. (2023). Berkeley CS285, Deep Reinforcement Learning, Lectures 19 and 23. http://rail.eecs.berkeley.edu/deeprlcourse/. Lecture videos at https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps.

Random exploration

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level Control Through Deep Reinforcement Learning. Nature, 518(7540), 529-533. https://www.nature.com/articles/nature14236 The DQN paper; epsilon-greedy is the default exploration strategy. Reference for the standard random-exploration baseline.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning. ICML 2018. https://arxiv.org/abs/1801.01290 SAC; the maximum-entropy training objective makes the policy stochastic throughout training as a mild exploration strategy.

Optimism-based exploration

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47, 235-256. https://link.springer.com/article/10.1023/A:1013689704352 The UCB algorithm. Foundational result on optimism in the face of uncertainty in the bandit setting.
Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research, 11, 1563-1600. https://jmlr.org/papers/v11/jaksch10a.html UCRL2; extends UCB-style optimism to the full MDP setting with provable regret bounds.
Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016). Deep Exploration via Bootstrapped DQN. NeurIPS 2016. https://arxiv.org/abs/1602.04621 Bootstrapped DQN; ensemble-based posterior approximation for deep-RL exploration.
Fortunato, M., Azar, M. G., Piot, B., et al. (2018). Noisy Networks for Exploration. ICLR 2018. https://arxiv.org/abs/1706.10295 NoisyNets; parameter-level noise as an exploration mechanism for deep RL.
Osband, I., Van Roy, B., & Wen, Z. (2014). Generalization and Exploration via Randomized Value Functions. ICML 2016. https://arxiv.org/abs/1402.0635 RLSVI; randomized least-squares value iteration as a posterior-sampling exploration method.

Intrinsic motivation

Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven Exploration by Self-supervised Prediction. ICML 2017. https://arxiv.org/abs/1705.05363 The ICM paper. Forward-model prediction error as intrinsic curiosity reward; applied to Mario, ViZDoom, and Atari.
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). Exploration by Random Network Distillation. ICLR 2019. https://arxiv.org/abs/1810.12894 The RND paper. Random-target distillation as intrinsic reward; the breakthrough on Montezuma’s Revenge.
Tang, H., Houthooft, R., Foote, D., et al. (2017). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning. NeurIPS 2017. https://arxiv.org/abs/1611.04717 Hash-based pseudo-counts for high-dimensional state spaces.
Bellemare, M. G., Srinivasan, S., Ostrovski, G., et al. (2016). Unifying Count-Based Exploration and Intrinsic Motivation. NeurIPS 2016. https://arxiv.org/abs/1606.01868 Density-model-based pseudo-counts; an earlier path to count-based bonuses in deep RL.
Burda, Y., Edwards, H., Pathak, D., et al. (2019). Large-scale Study of Curiosity-driven Learning. ICLR 2019. https://arxiv.org/abs/1808.04355 Companion study to ICM; characterizes when curiosity-driven exploration helps and when it hurts.

Benchmarks and hard-exploration tasks

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The Arcade Learning Environment: An Evaluation Platform for General Agents. JAIR, 47, 253-279. https://arxiv.org/abs/1207.4708 The Atari benchmark suite. Montezuma’s Revenge, Pitfall, and PrivateEye are the canonical hard-exploration tasks.
Salimans, T., & Chen, R. (2018). Learning Montezuma’s Revenge from a Single Demonstration. https://openai.com/blog/learning-montezumas-revenge-from-a-single-demonstration/ Demonstration-bootstrap approach to hard exploration; complementary to RND.

Robotics and demonstrations

Kalashnikov, D., Irpan, A., Pastor, P., et al. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. CoRL 2018. https://arxiv.org/abs/1806.10293 Demonstration-bootstrapped manipulation; the practical answer in robotics where pure exploration cannot bootstrap manipulation priors.
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Overcoming Exploration in Reinforcement Learning with Demonstrations. ICRA 2018. https://arxiv.org/abs/1709.10089 Demonstrations plus standard RL as an exploration shortcut for sparse-reward robotic tasks.

Survey

Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., & Precup, D. (2021). A Survey of Exploration Methods in Reinforcement Learning. arXiv:2109.00157. https://arxiv.org/abs/2109.00157 Recent survey covering the three families and their deep-RL incarnations.

Note on the source mix

CS285 covers the foundations (random, optimism). CS285 covers later exploration material including model organisms of hard exploration. The lesson cites the canonical primary papers (Osband, Pathak, Burda) rather than re-derivations.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.