Skip to content

Lesson: DQN (replay buffer, target network, double Q-learning)

What you’ll be able to do after this lesson

Section titled “What you’ll be able to do after this lesson”

Lesson 6 named the deadly triad: function approximation + bootstrapping + off-policy data can diverge when all three combine, even though any two are safe. DQN is the engineering recipe that turned deep Q-learning from “diverges on Atari” into “human-level play across 49 games” (Mnih et al., Nature 2015). The recipe has three patches, one per triad leg.

By the end of this lesson you can:

  • Name each DQN patch and the triad leg it weakens.
  • Explain why the replay buffer lets a non-i.i.d. data stream behave closer to i.i.d. for the gradient update.
  • Explain why the target network stops the bootstrapped TD target from chasing itself.
  • Compute the max-overestimation bias on a small example (three actions, unit-variance noise, true Q equal to 0) and verify the closed-form result: the expected maximum is about 0.846, which equals three divided by two times the square root of pi.
  • Show how double Q-learning decouples action selection from action evaluation and eliminates the bias when the two networks are independent.
  • Recognize when a value-based system needs each patch (and when one is enough).

From Lesson 6, the three things that combine into divergence are:

  1. Function approximation (FA). A neural network Q-theta generalizes. One gradient update for one state-action pair shifts Q-theta at many other state-action pairs.
  2. Bootstrapping (BS). The bootstrapped target, the reward plus gamma times the maximum over next actions of Q-theta, uses Q-theta itself. The target moves as theta updates.
  3. Off-policy data (OP). Transitions come from a different distribution than the greedy policy’s distribution. In Q-learning this is unavoidable: the target uses the max over next actions regardless of how the data was collected.

The three DQN engineering tricks attack one leg each. Look at them in order.

The architecture from Mnih et al. (2015): input is a stack of four consecutive 84 by 84 grayscale Atari frames (the four-frame stack makes the input Markov, per the discussion in Lesson 3). Three convolution layers (32 filters 8 by 8 stride 4; 64 filters 4 by 4 stride 2; 64 filters 3 by 3 stride 1), then a fully connected layer of 512 units, then a linear output layer with one Q-value per discrete action (typically 4 to 18 actions per game).

The objective is the squared TD error from Lesson 6:

L(θ) = E_{(s, a, r, s') ~ D} [ ( r + γ · max_{a'} Q_{θ⁻}(s', a') - Q_θ(s, a) )² ]

with two annotations that turn this into a working algorithm:

  • D is a replay buffer of past transitions, not the most recent episode.
  • The target uses Q-theta-minus, a target network, with theta-minus periodically copied from theta but otherwise frozen.

Take each in turn.

Trick 1: replay buffer (patches the OP leg)

Section titled “Trick 1: replay buffer (patches the OP leg)”

When you collect transitions from one episode, consecutive samples are wildly correlated: the agent is in similar states, taking similar actions, seeing similar rewards. Gradient updates on highly correlated mini-batches are unstable for the same reason batch SGD on a tiny sorted dataset is unstable. You overfit the recent slice and forget the rest.

The replay buffer is a large circular buffer (the 2015 DQN used D = 1,000,000 transitions). At each training step:

  1. Take one environment step. Push the new transition (the state, action, reward, and next state) into the buffer.
  2. Sample a mini-batch of 32 transitions uniformly from the buffer.
  3. Compute the TD-error loss on the mini-batch and take one gradient step.

Why it helps the off-policy leg: the buffer holds transitions from many past policies (the network’s theta at the time was different). Sampling uniformly from this mixture is closer to sampling from a stationary distribution than sampling from the most recent on-policy trajectory. The transitions are also approximately decorrelated, because a uniform draw from a million transitions is unlikely to pick two consecutive frames.

Two costs: memory (a million transitions of 84 by 84 by 4 byte arrays is roughly 7 GB at byte precision) and sample efficiency (early transitions are stale, the policy that generated them no longer matches the policy being learned).

Later refinements: prioritized replay (Schaul et al., 2016) samples high-TD-error transitions more often; n-step replay stores multi-step transitions, the state and action followed by the next n rewards and the state n steps later, for lower-bias bootstrapping. Both are in the “Rainbow” combined-improvements paper.

Trick 2: target network (patches the BS leg)

Section titled “Trick 2: target network (patches the BS leg)”

If the bootstrapped target, the reward plus gamma times the maximum over next actions of Q-theta, uses the same theta you are updating, the loss surface itself moves under your feet. Imagine the cartoon failure mode: you nudge Q-theta up at one state-action pair; that nudge increases Q-theta at neighboring states because the network generalizes; the new target for the next batch is higher, so the next update pushes Q-theta higher still; and so on, into divergence.

The fix is to evaluate the target with a separate, frozen copy of the network. Maintain theta-minus, a snapshot of theta taken at the most recent target update step. Compute the target with theta-minus. Update theta against this target. Every C gradient steps (DQN 2015 used C equal to 10,000), copy theta into theta-minus.

For C steps the target is fixed, so the regression is back to a standard supervised problem (constant targets). Then the target shifts in one discrete update. The bootstrapping is still there, but the runaway feedback loop is broken.

Later refinements: Polyak averaging (the target network is nudged a small fraction of the way toward theta on every step, with a small averaging coefficient tau, used in DDPG and SAC) updates the target network gradually instead of in jumps. Both work; the discrete-copy version was DQN’s choice.

Trick 3: double Q-learning (patches the max overestimation bias)

Section titled “Trick 3: double Q-learning (patches the max overestimation bias)”

The third trick fixes a different problem that the first two do not. The max operator in the TD target is biased upward when the Q estimate is noisy. Even if Q-theta is unbiased for each individual state-action pair, taking the max over actions systematically overestimates Q-star.

Consider a single state with three actions and true Q-star equal to 0 for all three. The network estimate is noisy: the estimate Q-hat equals 0 plus a per-action noise term epsilon-a, with each epsilon-a drawn independently from a standard normal.

The single-network max target is:

max_{a} Q̂(s, a) = max(ε_1, ε_2, ε_3)

This is the maximum of three independent standard normals, an order statistic with a known closed form:

E[max_{a} Q̂(s, a)] = E[max of 3 iid N(0,1)] = 3 / (2√π) ≈ 0.8463

So even with zero true value and zero-mean noise, the max returns about +0.85 in expectation. With gamma = 0.99 this bias propagates: the next target is also too high, the next-next target is too high, and Q-theta drifts up. The 2015 DQN saw this empirically; van Hasselt et al. (2016) measured it on Atari and showed many games where DQN’s learned Q values exceed the maximum possible discounted return for the game.

Use two value functions, Q-theta (online network) and Q-theta-minus (target network). Decouple action selection from action evaluation:

  • Original DQN target: the reward plus gamma times the target network’s value at the next state, with the target network both picking and evaluating the best next action.
  • Double DQN target: the reward plus gamma times the target network’s value at the next state, but the best next action is picked by the online network and only evaluated by the target network.

In the idealized case where the two networks have independent noise (call the noise epsilon-theta and epsilon-theta-minus):

a* = argmax_a Q̂_θ(s', a) = argmax_a ε_{θ,a}

This selected action is determined by epsilon-theta alone, independent of epsilon-theta-minus. So the evaluated value is:

Q̂_{θ⁻}(s', a*) = 0 + ε_{θ⁻, a*}

Because the selected action is chosen independently of epsilon-theta-minus, the expected target-network noise at that action is zero. The bias is gone.

In practice the two networks share weights (target is a lagged copy), so independence is an idealization, not reality. The bias is reduced, not zeroed. Van Hasselt et al. (2016) showed this improvement is substantial on Atari: the same network architecture and training procedure as DQN, only the target rule changes, and the median normalized score across the 49-game benchmark goes up materially. (The paper’s claim is “approximately 17% improvement” by their normalization; the exact number depends on the metric.)

The DQN training loop:

Initialize Q_θ (random), target network Q_{θ⁻} ← Q_θ, replay buffer D (empty)
For each episode:
s ← environment.reset()
For each step:
With probability ε, a ← random action; else a ← argmax_a Q_θ(s, a)
s', r, done ← environment.step(a)
Push (s, a, r, s', done) into D
Sample mini-batch of B transitions from D
Compute target:
y = r + γ · (1 - done) · Q_{θ⁻}(s', argmax_{a'} Q_θ(s', a')) [double DQN]
Compute loss: L(θ) = (Q_θ(s, a) - y)²
Take one gradient step on L(θ)
Every C steps: θ⁻ ← θ
s ← s'

Each line of the inner loop has a job. The replay buffer mixes old and new transitions to approximate i.i.d. (weakens OP harm). The target network freezes the bootstrapped target so the regression has a fixed objective for C steps (patches BS). The split between online-network argmax and target-network evaluation removes the max overestimation (patches the related bias).

Function approximation (the FA leg of the triad) is the one ingredient you cannot patch away; it is the whole point of using a neural network. The other two patches are what make FA safe in this setting.

Before DQN, deep RL was a research curiosity that worked on toy problems. The 2015 DQN paper trained a single neural-network architecture (no per-game tuning) on 49 Atari 2600 games from raw pixels using the Atari Learning Environment (Bellemare et al., 2013). The result: median performance across the 49 games at roughly the level of a professional human game tester, with super-human performance on several games (Breakout, Pong, Boxing).

The hyperparameters were the same across all 49 games:

  • 1M-transition replay buffer
  • Target network copy every 10,000 steps
  • Mini-batch 32, RMSProp optimizer
  • Discount gamma = 0.99
  • epsilon linearly annealed from 1.0 to 0.1 over the first 1M frames, then fixed
  • 4-frame stack as input; 4-frame action repeat (decisions every 4 frames)
  • Reward clipping to the range minus 1 to plus 1 (the same network output range works across games with vastly different reward scales)
  • 50M frames of training per game (about 38 days of game time, hours of wall-clock on a GPU)

The reward-clipping detail matters and is often skipped in summaries: it forces all games into a common scale so the value function does not have to learn 1000-fold magnitude differences between, say, Pong (plus or minus 1 per point) and Q-bert (+25, +100, +200 per kill). A real production system would not clip; you would scale rewards adaptively or use a distributional Q-network (the C51 / IQN family).

Rainbow (Hessel et al., 2018) combined six refinements on top of DQN (double Q, dueling networks, prioritized replay, multi-step returns, distributional Q, noisy nets) and produced the strongest single-agent Atari numbers from a value-based method for several years.

  • Skipping the target network and expecting it to “just work.” The runaway feedback loop in the bootstrapped target is the second-most-common cause of DQN divergence (after forgetting the replay buffer). Modern open-source implementations include both by default; if you write your own from scratch, do not omit them.
  • Setting C too small. A target update frequency of C = 100 is too aggressive on most problems; the target hardly stays fixed long enough to provide a stable regression objective. The 2015 DQN’s C = 10,000 is in the right ballpark for Atari; on simpler problems you can drop to a few hundred, but always profile.
  • Confusing double Q-learning with double DQN. Hado van Hasselt’s original “double Q-learning” (NeurIPS 2010) used two independent online networks updated on alternating batches. Double DQN (van Hasselt, Guez, Silver, 2016) is the practical variant for DQN that reuses the existing target network (online net picks the action, target net evaluates). Most modern systems mean double DQN when they say “double Q-learning.”
  • Believing the max overestimation is small. With 18 actions (Atari max) and unit-variance noise, the max-of-18 bias is roughly E[max of 18 iid N(0,1)] ≈ 1.82. That is a large bias that compounds across bootstrapping. The closed-form numbers explain why double DQN matters even when individual estimates look “close to right.”
  • Assuming DQN’s tricks transfer to all off-policy actor-critic methods. They mostly do (DDPG uses replay buffer + target network; SAC adds entropy regularization on top). But the specific Atari hyperparameters do not transfer: continuous-control DDPG typically uses a smaller buffer and a softer target update (Polyak averaging instead of discrete copies).

DQN was the existence proof. Before 2015, the dominant view was that deep neural networks and reinforcement learning could not be combined safely at scale. Mnih et al. shipped a recipe (replay + target + double-Q, plus the engineering details around frame stacking and reward clipping) that worked. AlphaGo’s value network (2016) used the same DQN engineering ethos: replay buffer, target network, off-policy training from games against earlier versions of itself. AlphaZero (2017) and AlphaStar (2019) inherited the engineering, even though their algorithm was different (Monte Carlo tree search plus learned policy/value).

Modern policy-gradient methods like PPO (Lesson 8) do not need DQN’s full stack because they keep their data near on-policy: the on-policy constraint weakens the OP leg of the triad enough that you do not need a replay buffer or a target network. The trade-off is sample efficiency: PPO needs to keep collecting fresh trajectories, while DQN can re-train on a buffer for many gradient steps per environment step.

The recipe also explains why off-policy methods dominate when data is precious. RLHF training of large language models (Lesson 13) draws on the off-policy lineage: the reward model is trained on labeled comparisons collected once; the policy is then optimized against that fixed reward model, which is a form of off-policy improvement (the reward model came from comparisons of past policies). The DQN engineering does not apply directly to RLHF (the action space is the entire vocabulary, the argmax is infeasible), but the off-policy mindset is the same.

  • DQN = Q-learning + three engineering patches, one per problem identified in Lesson 6.
  • Replay buffer: a circular buffer of past transitions. Sampling uniformly approximates i.i.d. data and decorrelates consecutive frames. Patches the OP leg of the triad.
  • Target network: a frozen copy Q-theta-minus of the online network, used only for computing TD targets, refreshed every C steps. Patches the BS leg by giving the regression a fixed objective for C steps.
  • Double Q-learning: action selection uses online network Q-theta; action evaluation uses target network Q-theta-minus. Eliminates the max overestimation bias in the idealized independent case; reduces it substantially in practice. On the three-action toy example with unit-variance noise, the bias drops from approximately 0.85 to 0.
  • The Atari benchmark: 49 games, one architecture, one hyperparameter set, median performance at professional human-tester level. This was the deep-RL existence proof.
  • What you cannot patch away: function approximation. That is the whole point of using a neural network. The other two patches are what make FA safe.

Next lesson: PPO, the workhorse of modern RL. Same goals (stable updates with function approximation) achieved by staying near-on-policy and clipping how much the policy can change per step. The contrast with DQN is the right way to understand why both algorithms exist.