Cheatsheet: Exploration
Three exploration families
Section titled “Three exploration families”| Family | Mechanism | Examples | Best for |
|---|---|---|---|
| Random | Take a random action with some probability, or maintain policy stochasticity | Epsilon-greedy, Boltzmann sampling, policy-entropy regularization | Easy exploration (dense or moderately sparse reward) |
| Optimism | Act as if uncertain Q-values were at upper confidence bound | UCB (bandits), UCRL (tabular MDPs), Bootstrapped DQN, NoisyNets, RLSVI | Easy-medium exploration; theoretically principled |
| Intrinsic motivation | Augment extrinsic reward with novelty/curiosity bonus | Count-based, ICM, RND | Hard exploration (sparse or long-sequence rewards) |
Easy vs hard exploration
Section titled “Easy vs hard exploration”| Property | Easy exploration | Hard exploration |
|---|---|---|
| Reward density | Dense or moderately sparse | Reached only by long specific action sequences |
| Random-exploration finds reward? | Yes, within training budget | No, exponentially unlikely |
| Examples | CartPole, Pong, Breakout, MuJoCo locomotion | Montezuma’s Revenge, Pitfall, maze with terminal reward, robot manipulation |
| Recommended family | Random or optimism | Intrinsic motivation (sometimes plus demonstrations) |
| Best benchmark result | epsilon-greedy DQN or PPO | RND-augmented PPO |
RND (the breakthrough)
Section titled “RND (the breakthrough)”Two networks:
- Target network: fixed random initialization, never trained
- Predictor network: trained to match the target on agent-visited states
Intrinsic reward at state s:
r_intrinsic(s) = || predictor(s) - target(s) ||²Novel states have high prediction error (predictor never trained on them); visited states have low error after training. The fixed-random target is the trick: the predictor can only match via training, which only happens at visited states.
Total reward optimized:
r_total = r_extrinsic + beta · r_intrinsicOptimism-based mechanisms
Section titled “Optimism-based mechanisms”| Approach | What it does | Setting |
|---|---|---|
| UCB (Auer 2002) | Pick action maximizing empirical mean + sqrt(log(n) / n_a) confidence bonus | Bandits, provably near-optimal regret |
| UCRL (Jaksch et al. 2010) | Plan against optimistic MDP given confidence intervals on transitions/rewards | Tabular MDPs, near-optimal regret bound |
| Bootstrapped DQN (Osband et al. 2016) | Ensemble of Q-networks on different bootstrapped buffers; pick a member per episode | Deep RL, approximate posterior |
| NoisyNets (Fortunato et al. 2018) | Learnable parameter-level noise on the network | Deep RL, parameter-noise exploration |
| RLSVI (Osband et al. 2014) | Randomized least-squares value iteration; Thompson-sampling-style | Tabular and approximate settings |
Random exploration variants
Section titled “Random exploration variants”| Variant | Formula | Notes |
|---|---|---|
| Epsilon-greedy | With prob epsilon, uniform random action; else greedy | DQN default; anneal epsilon |
| Boltzmann/softmax | Probability proportional to exp(Q / T) | Tunable concentration via T |
| Entropy regularization | Training loss adds -beta · H(policy) | PPO, SAC default; mild exploration |
Decision rubric
Section titled “Decision rubric”- Is the reward dense? Yes → random exploration is enough.
- Is the reward moderately sparse? Random first; try Bootstrapped DQN or NoisyNets if random underperforms.
- Is the reward extremely sparse / long-sequence? Jump to RND or ICM. Consider supplementing with demonstrations.
- Continuous high-dimensional actions? RND with PPO is the standard.
- Tabular small state space? Optimism-based methods (UCB, UCRL) are theoretically clean.
Common pitfalls
Section titled “Common pitfalls”- Treating epsilon-greedy as automatic (fails on hard exploration)
- Conflating exploration regimes (random and intrinsic motivation are not interchangeable)
- Under-tuning intrinsic-reward weight beta (both extremes fail)
- Forgetting intrinsic motivation diminishes (by design, not a failure)
- Treating entropy regularization as real exploration on hard environments
What you should remember
Section titled “What you should remember”- Three families: random, optimism, intrinsic motivation.
- Easy vs hard exploration is the dominant decision criterion.
- RND was the Montezuma’s Revenge breakthrough.
- Intrinsic motivation diminishes as the agent learns.
- Entropy regularization is mild exploration, not the answer for hard environments.