Offline RL, the problem

Every algorithm in this track so far has assumed the agent can act in the environment. REINFORCE samples trajectories, PPO collects new rollouts after every policy update, DQN gathers experience in a replay buffer while it learns, model-based RL plans against a learned simulator it can query for free. The agent acts, the environment responds, the data the policy learns from is generated by the policy itself or by its near-relatives.

In a large class of real-world settings, this assumption is broken. A hospital has years of treatment records but cannot run a randomized policy on patients to collect a better dataset. An industrial plant has process logs but cannot let an experimental controller perturb production. A recommender system has billions of user interactions but most product teams will not let a new policy run online without offline validation first. A robot platform has demonstration data but live data collection on the physical robot is expensive and slow. In all of these, you have a fixed dataset of past interactions and no way to gather more. The question of this lesson and the next is whether you can still extract a useful policy from that dataset, and if so, how.

The setting where the answer is yes (carefully) is called offline reinforcement learning (sometimes batch RL). The setting where the naive answer is “just run Q-learning on the dataset” turns out to be catastrophically wrong. This lesson names the failure mode. The next lesson covers the algorithms (BCQ, CQL, IQL) that fix it.

The offline RL setting, precisely

Suppose some past policy, call it the behavior policy, was deployed and produced a dataset of transitions:

D = { (s_i, a_i, r_i, s'_i) }  for i = 1, ..., N

Each tuple says: at some logged state, the behavior policy chose an action, and the environment returned a reward and a next state. The behavior policy itself may not be known explicitly (the data may come from a mix of human operators, an older controller, a logged web service). What is fixed is the dataset.

The task: produce a policy that achieves higher expected return than the behavior policy, using only the dataset, with no further environment interaction.

This is a strictly harder problem than standard off-policy RL. Off-policy RL (DQN is the textbook example) generates data from one policy (often a near-current policy plus exploration noise) and trains a different target policy. The data distribution can drift, but new data arrives every step and corrects whatever errors the value function accumulates. In offline RL there is no new data. Whatever errors enter the value function stay there. And the errors entering the value function are, as you will see, not small.

Why off-policy methods seem like the obvious answer

Q-learning is off-policy: the Bellman update

Q(s, a)  <-  r + gamma · max over a' of Q(s', a')

does not assume the action a was generated by the current policy. The update propagates value information from (s’, a’) back to (s, a) for any sampled transition, regardless of which policy generated it. So in principle, you could run Q-learning on a fixed offline dataset, learn a Q-function, and act greedily on it.

This is the standard first try. It does not work. The reason is the max operator, applied at out-of-distribution actions.

The failure mode: extrapolation error

A neural-network Q-function approximates Q(s, a) for any state and any action. For states and actions that appear in the dataset, the network learns Q-values driven by the data. For state-action pairs absent from the dataset, the network produces values by extrapolation: it generalizes from nearby (state, action) inputs to outputs whose accuracy depends entirely on the network’s inductive biases. The network does not know that those values are uninformed; it just outputs a number.

In standard online Q-learning this is fine. If the network extrapolates an inflated Q-value for some (s, a), the policy will try that action, the environment will return the actual (typically much lower) reward, and the Bellman update will correct the inflated estimate at the next visit. The correction is automatic because the agent can keep acting.

In offline Q-learning the correction never arrives. The Bellman target uses the max over actions at the next state:

target = r + gamma · max over a' of Q(s', a')

The max selects the action with the highest Q-value, including out-of-distribution actions where Q is uninformed. If the network happens to assign a high Q-value to an OOD action, the max picks it. The high (uninformed) value propagates backward through the Bellman update into Q(s, a), and the next iteration of the update uses this inflated value at the previous step, and so on. Each iteration pushes the Q-function upward. Over many iterations the Q-values diverge.

This is called extrapolation error in the offline-RL literature. Three sources, all interacting:

Function approximation extrapolates without warning. A neural network does not say “I do not know this action”; it produces a value.
The max operator is biased toward overestimates. Whatever variance the network has, the max picks the upper tail. This is the same overestimation bias that motivated double-Q in DQN, but now compounded across a fixed dataset where corrections cannot arrive.
Bellman propagation amplifies the error. Inflated values at one state become Bellman targets at the previous state, then the previous, then the previous.

The end state of naive offline Q-learning is a Q-function with very large values on out-of-distribution actions and a greedy policy that prefers those actions. At deployment, the policy acts on inflated estimates and performs catastrophically worse than the behavior policy that generated the data.

A small numerical illustration

Make this concrete with a two-state, two-action MDP. States s1 and s2, actions a1 and a2. The true dynamics (which the learner does not know but we do, for ground truth):

At s1, action a1: reward 0, transition to s2 with prob 1
At s1, action a2: reward 0, transition to s1 with prob 1
At s2, action a1: reward 1, episode ends
At s2, action a2: reward -10, episode ends

The true optimal policy is “a1 at s1, a1 at s2”, and the true discounted value of s1 (γ=0.9) is 0.9 (one transition gets you to s2 where a1 pays 1 next step, discounted to 0.9).

The behavior policy that generated the dataset always picks a1 at s2 (it learned this part). At s1 it picks a1 80% of the time and a2 20% of the time. So the dataset contains transitions:

(s1, a1, 0, s2): about 80 of every 100 episodes
(s1, a2, 0, s1): about 20 of every 100 episodes
(s2, a1, 1, terminal): every episode that reached s2

Crucially, the dataset never contains (s2, a2, -10, terminal). Action a2 at s2 is out of distribution.

Now run Q-learning on this dataset. Initialize Q(s, a) = 0 for all (s, a). At an update that targets s1:

target = 0 + gamma · max( Q(s2, a1), Q(s2, a2) )

If the function approximator happens to extrapolate Q(s2, a2) to any positive value (say 5, by random initialization or some inductive bias), the max picks a2 with value 5. The Q(s1, a1) update inflates to gamma · 5. With gamma = 0.9 that is 4.5. The next iteration of the Bellman update at s2 itself uses the same max, and Q(s2, a2) gets reinforced because nothing in the dataset contradicts it. The Q-function grows.

The greedy policy implied by this Q-function: prefer a2 at s2. At deployment, a2 at s2 returns reward -10 instead of the +1 that a1 would have returned; from s1, the discounted return is 0 + 0.9 · (-10) = -9 instead of the +0.9 a1-at-s2 would have produced. The “learned” policy is worse than the behavior policy by exactly the gap the extrapolation error opened up.

This is a toy. In a realistic dataset with high-dimensional state and dozens of actions, the OOD-action surface is enormous, and the extrapolation error is not contained.

Why this is harder than online off-policy RL

DQN is off-policy and uses the same Bellman max. Why does DQN work online but the same idea fail offline?

Three reasons:

In DQN the policy explores. Even an epsilon-greedy policy occasionally takes the inflated-Q-value action, the environment returns the actual reward, and the Bellman update at that transition pulls the Q-value back toward truth. The correction is mechanical.
In DQN the replay buffer keeps refreshing. Old transitions age out as new ones arrive from the current policy’s distribution. The training distribution tracks the policy.
In DQN inflation is bounded by the dynamics. A few iterations of error growth before the policy tries the action and gets corrected. Persistent runaway does not happen.

In offline RL all three of those correction channels are closed. The dataset is fixed. The policy cannot explore. Inflated Q-values for OOD actions stay inflated, and the Bellman propagation that amplifies them runs unchecked across the full training.

The structural lesson: “off-policy” is not the same as “offline.” Off-policy methods assume the data distribution can drift but new data keeps arriving. Offline methods must operate when no new data ever arrives.

What about behavioral cloning?

Behavioral cloning, from the imitation-learning lesson, is offline by construction. It uses the logged actions as supervised labels and trains a policy that imitates the behavior policy. BC is safe in the sense that it does not invent Q-values for OOD actions because it does not learn Q-values at all. The trained policy stays inside the data distribution by construction.

The cost is the obvious one: BC cannot, in general, exceed the behavior policy’s performance. The imitation lesson established this. A behavior policy that is a sub-optimal human controller produces a BC policy that is at best a sub-optimal human controller.

The promise of offline RL is to do better: extract a policy from the dataset that improves on the behavior policy, by recognizing that the dataset contains information the behavior policy did not exploit. Two transitions through state s1 might show that action a1 gave reward 0 and action a2 gave reward 0, but the next state under a1 leads to a region where reward 10 was eventually collected and the region under a2 leads to a region where reward -1 was collected. The BC policy mimics whichever action the behavior took; the offline-RL policy should reason about value and pick a1.

The reasoning step is exactly where extrapolation error bites, and the reason offline RL is harder than BC. The next lesson covers the algorithms that find a workable balance.

Why this matters when you use AI

The offline RL setting describes most deployment realities. Healthcare datasets: years of treatment-outcome logs, no new randomization allowed. Educational platforms: years of student interactions, no A/B testing on minors. Industrial control: years of plant-operation logs, no perturbing production. Recommender systems: production logs at scale, no live policy experiments. Robotics: demonstration datasets and offline-validated policy updates before any new policy goes near hardware. Language-model post-training: massive preference datasets curated once, no further human labeling per training step.

Each of these settings is one where naive off-policy RL diverges and where the offline-RL algorithms of the next lesson are the practical answer. Understanding what fails and why fails determines whether you reach for BC (safe but bounded), for an offline-RL algorithm (potentially better but only if the constraint or penalty is calibrated), or for a hybrid pipeline (an offline-trained policy that is then carefully evaluated and incrementally allowed to act online).

Common pitfalls

Conflating off-policy with offline. Off-policy means the training data was generated by a different policy than the one being trained, but new data keeps arriving from some policy (often the current one with exploration noise). Offline means no new data ever arrives. Off-policy methods like DQN handle the first but not the second.

Assuming a large dataset solves the problem. Bigger datasets help but do not eliminate the failure mode. The OOD-action surface scales with the action space, not the dataset size. A large dataset with rare actions still has rare-action regions where extrapolation error operates.

Treating offline RL as supervised learning with reward. It is not. The Bellman update is still active. A supervised loss on (state, action) pairs is behavioral cloning. Offline RL learns a value function and uses it for policy improvement, which is what introduces the OOD-action problem.

Underestimating extrapolation error empirically. In simple environments naive offline Q-learning sometimes appears to work for a few hundred iterations before diverging. The divergence is not random failure; it is the same compounding mechanism that always operates, just slower in simple environments. Production systems that train longer expose it.

Treating BC as an automatic baseline. BC is a baseline but it is also an upper bound on what naive imitation can do. An offline-RL algorithm that does not exceed BC on the same dataset is not adding value.

What you should remember

Offline RL fixes the dataset and forbids new data collection. It is the deployment-realistic setting for healthcare, recommender systems, industrial control, robotics demonstration learning, and language-model post-training.
Naive off-policy methods catastrophically fail in this setting. Q-learning’s max operator selects out-of-distribution actions where the function approximator extrapolates Q-values with no constraint. Inflated values propagate backward via Bellman updates, and with no environment feedback to correct them, the Q-function and the greedy policy diverge.
Off-policy is not offline. The DQN-style correction mechanism (act, observe true reward, update) is closed when no acting is allowed. The same algorithm that works online diverges offline.
Behavioral cloning is the safe-but-bounded alternative. BC stays inside the data distribution and is bounded by the behavior policy’s performance. Offline RL aspires to exceed BC by reasoning about value, which is exactly where extrapolation error bites.
The fix is constraint or penalty on out-of-distribution actions. The next lesson covers BCQ (action-set constraint), CQL (conservative Q penalty on OOD actions), and IQL (implicit Q-learning, sidesteps the max). All three address the same failure mode by different mechanisms.

The next lesson is about exactly those algorithms.