Skip to content

Summary: Exploration

When an agent can act in an environment but the reward is sparse or hard to find, exploration becomes the central problem. Three families address it. Random exploration (epsilon-greedy, Boltzmann sampling, policy-entropy regularization) is simple and asymptotically correct in tabular settings but provably fails on hard exploration where the reward is reached only by long specific action sequences. Optimism-based exploration (UCB in bandits, UCRL in MDPs, deep-RL adaptations like Bootstrapped DQN, NoisyNets, RLSVI) is principled: maintain uncertainty estimates over Q-values and act on the upper confidence bound. Theoretically clean, practically competitive on easy-to-medium exploration. Intrinsic motivation (count-based bonuses, ICM curiosity, RND) augments the extrinsic reward with a novelty signal that drives the agent into unexplored states. RND-augmented PPO was the breakthrough on Montezuma’s Revenge, going from near-zero scores with epsilon-greedy DQN to super-human scores. The dominant decision criterion is easy vs hard exploration: easy environments are well-served by random or optimism; hard environments require intrinsic motivation.

  1. Three exploration families. Random (epsilon-greedy, entropy), optimism (UCB-derived, Bootstrapped DQN), intrinsic motivation (RND, ICM, count-based).
  2. The hard-vs-easy distinction is the dominant choice criterion. Easy means random exploration reaches the reward within training budget; hard means it provably does not.
  3. RND was the breakthrough. Montezuma’s Revenge solved via Random Network Distillation augmenting PPO. The fixed-random target plus trainable predictor is the clever trick.
  4. Intrinsic motivation diminishes as the agent learns. ICM prediction error decays; RND distillation error decays. By design, not a failure.
  5. Entropy regularization is not real exploration on hard environments. It keeps the policy stochastic at the action level but does not push toward unvisited states.

Exploration choice constrains what an RL-trained system has seen. Weak exploration on a complex task produces a system that looks competent on the training distribution and fails on edge cases it never visited. This is the structural reason RLHF-tuned language models sometimes get stuck in local response shapes: the post-training exploration was at the token-sampling level, not at the response-space-exploration level, so unexplored response shapes never receive any training signal. Understanding exploration as the data-coverage shaper, not a tuning knob, changes how you read RL-trained system claims.

EnvironmentHardnessRecommended familyWhy
CartPoleEasyRandom (epsilon-greedy or entropy)Dense per-step reward; random action keeps signal flowing
Montezuma’s RevengeHardRND (intrinsic motivation)Long action sequences for first reward; random exploration provably fails
MuJoCo HopperEasy-mediumEntropy regularization in PPO/SACContinuous control with per-step velocity reward; PPO defaults work
Maze (terminal-only reward)HardCount-based or RNDSparse signal; novelty bonuses drive coverage
Robot pick-and-placeHardDemonstrations + intrinsic motivationSparse multi-step success signal; demonstrations bootstrap, intrinsic motivation refines
  • L7 DQN introduced epsilon-greedy as the default exploration; L16 names where that choice fails.
  • L8 PPO added entropy regularization to keep the policy stochastic; L16 contextualizes it as mild exploration insufficient for hard tasks.
  • L13 RLHF explores at the token-sampling level; the policy-shape exploration is at the SFT initialization. Local-minimum behaviors in fine-tuned models often trace to this.
  • L17 next is multi-task and meta-RL: when the agent has to learn many related tasks, can the structure across tasks help with both exploration and sample efficiency?