Summary: Exploration
The one-paragraph version
Section titled “The one-paragraph version”When an agent can act in an environment but the reward is sparse or hard to find, exploration becomes the central problem. Three families address it. Random exploration (epsilon-greedy, Boltzmann sampling, policy-entropy regularization) is simple and asymptotically correct in tabular settings but provably fails on hard exploration where the reward is reached only by long specific action sequences. Optimism-based exploration (UCB in bandits, UCRL in MDPs, deep-RL adaptations like Bootstrapped DQN, NoisyNets, RLSVI) is principled: maintain uncertainty estimates over Q-values and act on the upper confidence bound. Theoretically clean, practically competitive on easy-to-medium exploration. Intrinsic motivation (count-based bonuses, ICM curiosity, RND) augments the extrinsic reward with a novelty signal that drives the agent into unexplored states. RND-augmented PPO was the breakthrough on Montezuma’s Revenge, going from near-zero scores with epsilon-greedy DQN to super-human scores. The dominant decision criterion is easy vs hard exploration: easy environments are well-served by random or optimism; hard environments require intrinsic motivation.
Five things to remember
Section titled “Five things to remember”- Three exploration families. Random (epsilon-greedy, entropy), optimism (UCB-derived, Bootstrapped DQN), intrinsic motivation (RND, ICM, count-based).
- The hard-vs-easy distinction is the dominant choice criterion. Easy means random exploration reaches the reward within training budget; hard means it provably does not.
- RND was the breakthrough. Montezuma’s Revenge solved via Random Network Distillation augmenting PPO. The fixed-random target plus trainable predictor is the clever trick.
- Intrinsic motivation diminishes as the agent learns. ICM prediction error decays; RND distillation error decays. By design, not a failure.
- Entropy regularization is not real exploration on hard environments. It keeps the policy stochastic at the action level but does not push toward unvisited states.
Why this matters
Section titled “Why this matters”Exploration choice constrains what an RL-trained system has seen. Weak exploration on a complex task produces a system that looks competent on the training distribution and fails on edge cases it never visited. This is the structural reason RLHF-tuned language models sometimes get stuck in local response shapes: the post-training exploration was at the token-sampling level, not at the response-space-exploration level, so unexplored response shapes never receive any training signal. Understanding exploration as the data-coverage shaper, not a tuning knob, changes how you read RL-trained system claims.
Worked check (memory anchor)
Section titled “Worked check (memory anchor)”| Environment | Hardness | Recommended family | Why |
|---|---|---|---|
| CartPole | Easy | Random (epsilon-greedy or entropy) | Dense per-step reward; random action keeps signal flowing |
| Montezuma’s Revenge | Hard | RND (intrinsic motivation) | Long action sequences for first reward; random exploration provably fails |
| MuJoCo Hopper | Easy-medium | Entropy regularization in PPO/SAC | Continuous control with per-step velocity reward; PPO defaults work |
| Maze (terminal-only reward) | Hard | Count-based or RND | Sparse signal; novelty bonuses drive coverage |
| Robot pick-and-place | Hard | Demonstrations + intrinsic motivation | Sparse multi-step success signal; demonstrations bootstrap, intrinsic motivation refines |
Where this fits in the broader curriculum
Section titled “Where this fits in the broader curriculum”- L7 DQN introduced epsilon-greedy as the default exploration; L16 names where that choice fails.
- L8 PPO added entropy regularization to keep the policy stochastic; L16 contextualizes it as mild exploration insufficient for hard tasks.
- L13 RLHF explores at the token-sampling level; the policy-shape exploration is at the SFT initialization. Local-minimum behaviors in fine-tuned models often trace to this.
- L17 next is multi-task and meta-RL: when the agent has to learn many related tasks, can the structure across tasks help with both exploration and sample efficiency?