Skip to content

Brief: Exploration

You will name the three exploration families (random, optimism-based, intrinsic-motivation), describe the mechanism of each (epsilon-greedy and entropy regularization; UCB and its deep-RL adaptations like Bootstrapped DQN and NoisyNets; ICM curiosity and RND novelty), and apply the easy-vs-hard-exploration distinction as the dominant decision criterion for picking among them. You will recognize why random exploration provably fails on hard-exploration environments (the probability of long specific action sequences shrinks exponentially with sequence length), understand the RND mechanism (fixed-random target plus trainable predictor) and why it was the breakthrough on Montezuma’s Revenge, and leave with a working rubric for the exploration choice on any new environment.

This is lesson 16 of Track 18 (Deep Reinforcement Learning), lesson 4 of Phase 3 (rl-frontiers). It pivots from the offline-RL pair (L14, L15) to the opposite problem: when the agent CAN act but the reward is sparse, how should it explore? It builds on L7 DQN (which uses epsilon-greedy as its default exploration), L8 PPO (which uses entropy regularization), and L13 RLHF (which contextualizes exploration as token-level sampling under a KL-regularized policy).

Berkeley CS285 (Sergey Levine, Fall 2023), lectures on Exploration. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Primary papers: Auer (2002) UCB; Jaksch et al. (2010) UCRL; Osband et al. (2016) Bootstrapped DQN; Fortunato et al. (2018) NoisyNets; Pathak et al. (2017) ICM; Burda et al. (2018) RND.

Phase 3 lesson 4 (phase_order: 4). After L16 follows L17 (Multi-task and meta-RL), L18 (Open problems, closes Phase 3 + Track 18).

  • Hook: L14/L15 covered offline; L16 returns to online but with a sharper question.
  • Three families: random, optimism, intrinsic motivation.
  • Random exploration: epsilon-greedy, Boltzmann, entropy regularization. Asymptotic guarantees, fails on hard exploration (Montezuma’s Revenge as the worked counter-example).
  • Optimism-based: UCB, UCRL, Bootstrapped DQN, NoisyNets, RLSVI. Principled, theoretically clean, competitive on easy-medium exploration.
  • Intrinsic motivation: count-based, ICM, RND. RND mechanism explained (fixed-random target plus trainable predictor); the Montezuma’s Revenge breakthrough.
  • Hard vs easy exploration distinction, table-form, with examples.
  • Where exploration fits in modern RL: LLM RLHF (token sampling), robotics (demonstrations plus intrinsic), recommender systems (bandits).
  • Why-this-matters: exploration choice constrains data coverage; weak exploration produces systems that fail on unexplored edges.
  • Common pitfalls (5): epsilon-greedy as automatic; conflating regimes; under-tuning beta; forgetting intrinsic decay; entropy-as-exploration.
  • 5 remember-bullets.
  • L17 setup.

Two exercises plus six flashcards.

  1. Classify the exploration hardness (5 environments): CartPole, Montezuma’s Revenge, MuJoCo Hopper, terminal-reward maze, robotic pick-and-place. Classify each as easy/moderate/hard and justify in one sentence.
  2. Pick the strategy for each Exercise 1 environment. Justify in two sentences with any additional advice.

Six flashcards: three families; why random fails on hard exploration; RND mechanism and breakthrough; easy-vs-hard distinction; intrinsic-reward weight tuning; entropy regularization as exploration.

Tables. Three families side by side (mechanism, examples, best for). Easy vs hard exploration comparison. RND two-network mechanism with intrinsic-reward formula in fenced display. Optimism-based variants. Random-exploration variants. Decision rubric (5-step). Common pitfalls. Remember-bullets.

CS285 and L23 primary. Random exploration: Mnih DQN, Haarnoja SAC. Optimism: Auer UCB, Jaksch UCRL2, Osband Bootstrapped DQN, Fortunato NoisyNets, Osband RLSVI. Intrinsic motivation: Pathak ICM, Burda RND, Tang count-based, Bellemare density-counts, Burda large-scale curiosity. Benchmarks: Bellemare ALE, Salimans Montezuma demonstration. Robotics: Kalashnikov QT-Opt, Nair demonstration-bootstrapped. Survey: Amin et al. 2021.

  • Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0), Greek letters in prose spelled out (beta, epsilon as relevant; symbols only in fenced display blocks), placeholder comments present on brief.
  • §6 watch-zone: technical exploration content; no policy or vendor-advocacy framing. The LLM-RLHF and recommender-systems contextualization is factual product-class reference.
  • Vendor naming: DeepMind (Atari benchmark, NoisyNets), OpenAI (RND, ICM-related), Google Robotics (QT-Opt) named only as paper-author affiliations; positive citations; A1 verbatim discipline n/a.
  • Lesson 2384
  • Practice 1320
  • Summary 615
  • Cheatsheet 690
  • References 685
  • Brief 875

Total ≈ 6569 words across 6 artifacts.

  • Component placeholders (�J0�, �J1�) as MDX comments. �J2� for CS285+L23 “Exploration”.
  • Practice uses real �J0� + �J1� component imports.
  • L7 DQN prereq path: lessons/deep-reinforcement-learning/dqn. L8 PPO: lessons/deep-reinforcement-learning/ppo. L13 RLHF: lessons/deep-reinforcement-learning/rlhf. L14/L15 offline: lessons/deep-reinforcement-learning/offline-rl-problem and .../offline-rl-algorithms.
  • Lesson body uses fenced display block for the intrinsic-reward sum (r_total formula) and the RND distillation-error formula. Greek symbols stay in those fenced blocks; prose spells beta / epsilon.
  • L16 pivots Phase 3 from offline-RL to data-efficiency questions (exploration here, multi-task and meta in L17).