Exploration strategies: cheatsheet

Three exploration families

Family	Mechanism	Examples	Best for
Random	Take a random action with some probability, or maintain policy stochasticity	Epsilon-greedy, Boltzmann sampling, policy-entropy regularization	Easy exploration (dense or moderately sparse reward)
Optimism	Act as if uncertain Q-values were at upper confidence bound	UCB (bandits), UCRL (tabular MDPs), Bootstrapped DQN, NoisyNets, RLSVI	Easy-medium exploration; theoretically principled
Intrinsic motivation	Augment extrinsic reward with novelty/curiosity bonus	Count-based, ICM, RND	Hard exploration (sparse or long-sequence rewards)

Easy vs hard exploration

Property	Easy exploration	Hard exploration
Reward density	Dense or moderately sparse	Reached only by long specific action sequences
Random-exploration finds reward?	Yes, within training budget	No, exponentially unlikely
Examples	CartPole, Pong, Breakout, MuJoCo locomotion	Montezuma’s Revenge, Pitfall, maze with terminal reward, robot manipulation
Recommended family	Random or optimism	Intrinsic motivation (sometimes plus demonstrations)
Best benchmark result	epsilon-greedy DQN or PPO	RND-augmented PPO

RND (the breakthrough)

Two networks:

Target network: fixed random initialization, never trained
Predictor network: trained to match the target on agent-visited states

Intrinsic reward at state s:

r_intrinsic(s) = || predictor(s) - target(s) ||²

Novel states have high prediction error (predictor never trained on them); visited states have low error after training. The fixed-random target is the trick: the predictor can only match via training, which only happens at visited states.

Total reward optimized:

r_total = r_extrinsic + beta · r_intrinsic

Optimism-based mechanisms

Approach	What it does	Setting
UCB (Auer 2002)	Pick action maximizing empirical mean + sqrt(log(n) / n_a) confidence bonus	Bandits, provably near-optimal regret
UCRL (Jaksch et al. 2010)	Plan against optimistic MDP given confidence intervals on transitions/rewards	Tabular MDPs, near-optimal regret bound
Bootstrapped DQN (Osband et al. 2016)	Ensemble of Q-networks on different bootstrapped buffers; pick a member per episode	Deep RL, approximate posterior
NoisyNets (Fortunato et al. 2018)	Learnable parameter-level noise on the network	Deep RL, parameter-noise exploration
RLSVI (Osband et al. 2014)	Randomized least-squares value iteration; Thompson-sampling-style	Tabular and approximate settings

Random exploration variants

Variant	Formula	Notes
Epsilon-greedy	With prob epsilon, uniform random action; else greedy	DQN default; anneal epsilon
Boltzmann/softmax	Probability proportional to exp(Q / T)	Tunable concentration via T
Entropy regularization	Training loss adds -beta · H(policy)	PPO, SAC default; mild exploration

Decision rubric

Is the reward dense? Yes → random exploration is enough.
Is the reward moderately sparse? Random first; try Bootstrapped DQN or NoisyNets if random underperforms.
Is the reward extremely sparse / long-sequence? Jump to RND or ICM. Consider supplementing with demonstrations.
Continuous high-dimensional actions? RND with PPO is the standard.
Tabular small state space? Optimism-based methods (UCB, UCRL) are theoretically clean.

Common pitfalls

Treating epsilon-greedy as automatic (fails on hard exploration)
Conflating exploration regimes (random and intrinsic motivation are not interchangeable)
Under-tuning intrinsic-reward weight beta (both extremes fail)
Forgetting intrinsic motivation diminishes (by design, not a failure)
Treating entropy regularization as real exploration on hard environments

What you should remember

Three families: random, optimism, intrinsic motivation.
Easy vs hard exploration is the dominant decision criterion.
RND was the Montezuma’s Revenge breakthrough.
Intrinsic motivation diminishes as the agent learns.
Entropy regularization is mild exploration, not the answer for hard environments.