Skip to content

Cheatsheet: Exploration

FamilyMechanismExamplesBest for
RandomTake a random action with some probability, or maintain policy stochasticityEpsilon-greedy, Boltzmann sampling, policy-entropy regularizationEasy exploration (dense or moderately sparse reward)
OptimismAct as if uncertain Q-values were at upper confidence boundUCB (bandits), UCRL (tabular MDPs), Bootstrapped DQN, NoisyNets, RLSVIEasy-medium exploration; theoretically principled
Intrinsic motivationAugment extrinsic reward with novelty/curiosity bonusCount-based, ICM, RNDHard exploration (sparse or long-sequence rewards)
PropertyEasy explorationHard exploration
Reward densityDense or moderately sparseReached only by long specific action sequences
Random-exploration finds reward?Yes, within training budgetNo, exponentially unlikely
ExamplesCartPole, Pong, Breakout, MuJoCo locomotionMontezuma’s Revenge, Pitfall, maze with terminal reward, robot manipulation
Recommended familyRandom or optimismIntrinsic motivation (sometimes plus demonstrations)
Best benchmark resultepsilon-greedy DQN or PPORND-augmented PPO

Two networks:

  • Target network: fixed random initialization, never trained
  • Predictor network: trained to match the target on agent-visited states

Intrinsic reward at state s:

r_intrinsic(s) = || predictor(s) - target(s) ||²

Novel states have high prediction error (predictor never trained on them); visited states have low error after training. The fixed-random target is the trick: the predictor can only match via training, which only happens at visited states.

Total reward optimized:

r_total = r_extrinsic + beta · r_intrinsic
ApproachWhat it doesSetting
UCB (Auer 2002)Pick action maximizing empirical mean + sqrt(log(n) / n_a) confidence bonusBandits, provably near-optimal regret
UCRL (Jaksch et al. 2010)Plan against optimistic MDP given confidence intervals on transitions/rewardsTabular MDPs, near-optimal regret bound
Bootstrapped DQN (Osband et al. 2016)Ensemble of Q-networks on different bootstrapped buffers; pick a member per episodeDeep RL, approximate posterior
NoisyNets (Fortunato et al. 2018)Learnable parameter-level noise on the networkDeep RL, parameter-noise exploration
RLSVI (Osband et al. 2014)Randomized least-squares value iteration; Thompson-sampling-styleTabular and approximate settings
VariantFormulaNotes
Epsilon-greedyWith prob epsilon, uniform random action; else greedyDQN default; anneal epsilon
Boltzmann/softmaxProbability proportional to exp(Q / T)Tunable concentration via T
Entropy regularizationTraining loss adds -beta · H(policy)PPO, SAC default; mild exploration
  1. Is the reward dense? Yes → random exploration is enough.
  2. Is the reward moderately sparse? Random first; try Bootstrapped DQN or NoisyNets if random underperforms.
  3. Is the reward extremely sparse / long-sequence? Jump to RND or ICM. Consider supplementing with demonstrations.
  4. Continuous high-dimensional actions? RND with PPO is the standard.
  5. Tabular small state space? Optimism-based methods (UCB, UCRL) are theoretically clean.
  • Treating epsilon-greedy as automatic (fails on hard exploration)
  • Conflating exploration regimes (random and intrinsic motivation are not interchangeable)
  • Under-tuning intrinsic-reward weight beta (both extremes fail)
  • Forgetting intrinsic motivation diminishes (by design, not a failure)
  • Treating entropy regularization as real exploration on hard environments
  • Three families: random, optimism, intrinsic motivation.
  • Easy vs hard exploration is the dominant decision criterion.
  • RND was the Montezuma’s Revenge breakthrough.
  • Intrinsic motivation diminishes as the agent learns.
  • Entropy regularization is mild exploration, not the answer for hard environments.