Skip to content

References: Exploration

  • Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level Control Through Deep Reinforcement Learning. Nature, 518(7540), 529-533. https://www.nature.com/articles/nature14236 The DQN paper; epsilon-greedy is the default exploration strategy. Reference for the standard random-exploration baseline.
  • Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning. ICML 2018. https://arxiv.org/abs/1801.01290 SAC; the maximum-entropy training objective makes the policy stochastic throughout training as a mild exploration strategy.
  • Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47, 235-256. https://link.springer.com/article/10.1023/A:1013689704352 The UCB algorithm. Foundational result on optimism in the face of uncertainty in the bandit setting.
  • Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research, 11, 1563-1600. https://jmlr.org/papers/v11/jaksch10a.html UCRL2; extends UCB-style optimism to the full MDP setting with provable regret bounds.
  • Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016). Deep Exploration via Bootstrapped DQN. NeurIPS 2016. https://arxiv.org/abs/1602.04621 Bootstrapped DQN; ensemble-based posterior approximation for deep-RL exploration.
  • Fortunato, M., Azar, M. G., Piot, B., et al. (2018). Noisy Networks for Exploration. ICLR 2018. https://arxiv.org/abs/1706.10295 NoisyNets; parameter-level noise as an exploration mechanism for deep RL.
  • Osband, I., Van Roy, B., & Wen, Z. (2014). Generalization and Exploration via Randomized Value Functions. ICML 2016. https://arxiv.org/abs/1402.0635 RLSVI; randomized least-squares value iteration as a posterior-sampling exploration method.
  • Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven Exploration by Self-supervised Prediction. ICML 2017. https://arxiv.org/abs/1705.05363 The ICM paper. Forward-model prediction error as intrinsic curiosity reward; applied to Mario, ViZDoom, and Atari.
  • Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). Exploration by Random Network Distillation. ICLR 2019. https://arxiv.org/abs/1810.12894 The RND paper. Random-target distillation as intrinsic reward; the breakthrough on Montezuma’s Revenge.
  • Tang, H., Houthooft, R., Foote, D., et al. (2017). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning. NeurIPS 2017. https://arxiv.org/abs/1611.04717 Hash-based pseudo-counts for high-dimensional state spaces.
  • Bellemare, M. G., Srinivasan, S., Ostrovski, G., et al. (2016). Unifying Count-Based Exploration and Intrinsic Motivation. NeurIPS 2016. https://arxiv.org/abs/1606.01868 Density-model-based pseudo-counts; an earlier path to count-based bonuses in deep RL.
  • Burda, Y., Edwards, H., Pathak, D., et al. (2019). Large-scale Study of Curiosity-driven Learning. ICLR 2019. https://arxiv.org/abs/1808.04355 Companion study to ICM; characterizes when curiosity-driven exploration helps and when it hurts.
  • Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The Arcade Learning Environment: An Evaluation Platform for General Agents. JAIR, 47, 253-279. https://arxiv.org/abs/1207.4708 The Atari benchmark suite. Montezuma’s Revenge, Pitfall, and PrivateEye are the canonical hard-exploration tasks.
  • Salimans, T., & Chen, R. (2018). Learning Montezuma’s Revenge from a Single Demonstration. https://openai.com/blog/learning-montezumas-revenge-from-a-single-demonstration/ Demonstration-bootstrap approach to hard exploration; complementary to RND.
  • Kalashnikov, D., Irpan, A., Pastor, P., et al. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. CoRL 2018. https://arxiv.org/abs/1806.10293 Demonstration-bootstrapped manipulation; the practical answer in robotics where pure exploration cannot bootstrap manipulation priors.
  • Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Overcoming Exploration in Reinforcement Learning with Demonstrations. ICRA 2018. https://arxiv.org/abs/1709.10089 Demonstrations plus standard RL as an exploration shortcut for sparse-reward robotic tasks.
  • Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., & Precup, D. (2021). A Survey of Exploration Methods in Reinforcement Learning. arXiv:2109.00157. https://arxiv.org/abs/2109.00157 Recent survey covering the three families and their deep-RL incarnations.

CS285 covers the foundations (random, optimism). CS285 covers later exploration material including model organisms of hard exploration. The lesson cites the canonical primary papers (Osband, Pathak, Burda) rather than re-derivations.

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.