Brief: Exploration
What you will learn
Section titled “What you will learn”You will name the three exploration families (random, optimism-based, intrinsic-motivation), describe the mechanism of each (epsilon-greedy and entropy regularization; UCB and its deep-RL adaptations like Bootstrapped DQN and NoisyNets; ICM curiosity and RND novelty), and apply the easy-vs-hard-exploration distinction as the dominant decision criterion for picking among them. You will recognize why random exploration provably fails on hard-exploration environments (the probability of long specific action sequences shrinks exponentially with sequence length), understand the RND mechanism (fixed-random target plus trainable predictor) and why it was the breakthrough on Montezuma’s Revenge, and leave with a working rubric for the exploration choice on any new environment.
Where this fits
Section titled “Where this fits”This is lesson 16 of Track 18 (Deep Reinforcement Learning), lesson 4 of Phase 3 (rl-frontiers). It pivots from the offline-RL pair (L14, L15) to the opposite problem: when the agent CAN act but the reward is sparse, how should it explore? It builds on L7 DQN (which uses epsilon-greedy as its default exploration), L8 PPO (which uses entropy regularization), and L13 RLHF (which contextualizes exploration as token-level sampling under a KL-regularized policy).
Source
Section titled “Source”Berkeley CS285 (Sergey Levine, Fall 2023), lectures on Exploration. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Primary papers: Auer (2002) UCB; Jaksch et al. (2010) UCRL; Osband et al. (2016) Bootstrapped DQN; Fortunato et al. (2018) NoisyNets; Pathak et al. (2017) ICM; Burda et al. (2018) RND.
Phase advance
Section titled “Phase advance”Phase 3 lesson 4 (phase_order: 4). After L16 follows L17 (Multi-task and meta-RL), L18 (Open problems, closes Phase 3 + Track 18).
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Hook: L14/L15 covered offline; L16 returns to online but with a sharper question.
- Three families: random, optimism, intrinsic motivation.
- Random exploration: epsilon-greedy, Boltzmann, entropy regularization. Asymptotic guarantees, fails on hard exploration (Montezuma’s Revenge as the worked counter-example).
- Optimism-based: UCB, UCRL, Bootstrapped DQN, NoisyNets, RLSVI. Principled, theoretically clean, competitive on easy-medium exploration.
- Intrinsic motivation: count-based, ICM, RND. RND mechanism explained (fixed-random target plus trainable predictor); the Montezuma’s Revenge breakthrough.
- Hard vs easy exploration distinction, table-form, with examples.
- Where exploration fits in modern RL: LLM RLHF (token sampling), robotics (demonstrations plus intrinsic), recommender systems (bandits).
- Why-this-matters: exploration choice constrains data coverage; weak exploration produces systems that fail on unexplored edges.
- Common pitfalls (5): epsilon-greedy as automatic; conflating regimes; under-tuning beta; forgetting intrinsic decay; entropy-as-exploration.
- 5 remember-bullets.
- L17 setup.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises plus six flashcards.
- Classify the exploration hardness (5 environments): CartPole, Montezuma’s Revenge, MuJoCo Hopper, terminal-reward maze, robotic pick-and-place. Classify each as easy/moderate/hard and justify in one sentence.
- Pick the strategy for each Exercise 1 environment. Justify in two sentences with any additional advice.
Six flashcards: three families; why random fails on hard exploration; RND mechanism and breakthrough; easy-vs-hard distinction; intrinsic-reward weight tuning; entropy regularization as exploration.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”Tables. Three families side by side (mechanism, examples, best for). Easy vs hard exploration comparison. RND two-network mechanism with intrinsic-reward formula in fenced display. Optimism-based variants. Random-exploration variants. Decision rubric (5-step). Common pitfalls. Remember-bullets.
References (references.mdx)
Section titled “References (references.mdx)”CS285 and L23 primary. Random exploration: Mnih DQN, Haarnoja SAC. Optimism: Auer UCB, Jaksch UCRL2, Osband Bootstrapped DQN, Fortunato NoisyNets, Osband RLSVI. Intrinsic motivation: Pathak ICM, Burda RND, Tang count-based, Bellemare density-counts, Burda large-scale curiosity. Benchmarks: Bellemare ALE, Salimans Montezuma demonstration. Robotics: Kalashnikov QT-Opt, Nair demonstration-bootstrapped. Survey: Amin et al. 2021.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0), Greek letters in prose spelled out (beta, epsilon as relevant; symbols only in fenced display blocks), placeholder comments present on brief.
- §6 watch-zone: technical exploration content; no policy or vendor-advocacy framing. The LLM-RLHF and recommender-systems contextualization is factual product-class reference.
- Vendor naming: DeepMind (Atari benchmark, NoisyNets), OpenAI (RND, ICM-related), Google Robotics (QT-Opt) named only as paper-author affiliations; positive citations; A1 verbatim discipline n/a.
Word counts
Section titled “Word counts”- Lesson 2384
- Practice 1320
- Summary 615
- Cheatsheet 690
- References 685
- Brief 875
Total ≈ 6569 words across 6 artifacts.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) as MDX comments.�J2�for CS285+L23 “Exploration”. - Practice uses real
�J0�+�J1�component imports. - L7 DQN prereq path:
lessons/deep-reinforcement-learning/dqn. L8 PPO:lessons/deep-reinforcement-learning/ppo. L13 RLHF:lessons/deep-reinforcement-learning/rlhf. L14/L15 offline:lessons/deep-reinforcement-learning/offline-rl-problemand.../offline-rl-algorithms. - Lesson body uses fenced display block for the intrinsic-reward sum (r_total formula) and the RND distillation-error formula. Greek symbols stay in those fenced blocks; prose spells beta / epsilon.
- L16 pivots Phase 3 from offline-RL to data-efficiency questions (exploration here, multi-task and meta in L17).