Offline RL: brief

What you will learn

You will state the offline RL setting (fixed dataset, no further environment interaction), distinguish it cleanly from off-policy RL (off-policy is about WHO generated the data, offline is about WHETHER new data ever arrives), and trace the precise mechanism by which naive Q-learning catastrophically diverges in this setting: the Bellman max selects out-of-distribution actions, the function approximator extrapolates inflated Q-values for those actions, and Bellman propagation amplifies the error with no environment feedback to correct it. You will see the failure in a two-state worked example where the diverged policy is about 9.878 worse in expected discounted return than the data-generating behavior policy (γ=0.9; behavior return ≈ 0.878, greedy/diverged return = -9). You will recognize when behavioral cloning is the right offline baseline (safe but bounded by the behavior policy) and when an offline-RL algorithm with explicit OOD-action handling is justified, and leave with the agenda for the next lesson (BCQ, CQL, IQL as three families of fixes).

Where this fits

This is lesson 14 of Track 18 (Deep Reinforcement Learning), lesson 2 of Phase 3 (rl-frontiers). It opens the offline-RL pair (L14 problem definition; L15 algorithms). It builds on L7 DQN (off-policy Q-learning), L2 imitation learning (behavioral cloning as baseline), and L13 RLHF (which uses offline preference data but escapes the divergence trap via KL regularization).

Source

Berkeley CS285 (Sergey Levine, Fall 2023), lecture on Offline RL: Introduction. Canonical URL http://rail.eecs.berkeley.edu/deeprlcourse/. Lecture video at the standard YouTube playlist. The lesson follows the canonical setup from Levine, Kumar, Tucker, and Fu (2020) tutorial review and uses the extrapolation-error framing from Fujimoto, Meger, and Precup (2019) BCQ.

Phase advance

Phase 3 lesson 2 (phase_order: 2). After L14 follows L15 (Offline RL: algorithms, BCQ / CQL / IQL), then L16 (Exploration), L17 (Multi-task and meta-RL), L18 (Open problems, closes Phase 3 + Track 18).

Lesson body (lesson.mdx)

Hook: every algorithm in T18 has assumed the agent can act; healthcare / recommender systems / industrial control / robotics demonstration / language-model post-training all break that assumption.
The offline RL setting precisely: fixed dataset of (s, a, r, s’) tuples from a behavior policy, no further interaction, goal is a policy that outperforms the behavior policy.
Why off-policy methods seem applicable: Q-learning’s Bellman update is off-policy by construction.
The failure mode: extrapolation error. The Bellman max selects OOD actions where the function approximator extrapolates uninformed (often inflated) values. Bellman propagation amplifies. With no environment feedback to correct, the Q-function diverges and the greedy policy prefers OOD actions.
Three sources of extrapolation error: function approximation extrapolates silently; max operator biased toward overestimates; Bellman propagation amplifies.
Two-state worked example: dataset never observes (s2, a2). Q(s2, a2) extrapolated to 5. After convergence Q(s1, a1) approaches 4.5; deployed greedy policy prefers a2 at s2, discounted return from s1 = 0.9·(-10) = -9 instead of 0.9·(+1) = +0.9.
Why DQN works online but the same algorithm diverges offline: three correction channels open online (policy explores, replay buffer refreshes, inflation bounded by feedback timing) all closed offline.
Why BC is the natural baseline: BC stays in-distribution by construction; cost is bounded by behavior policy performance.
Common pitfalls (5): conflating off-policy with offline; assuming large dataset solves it; treating offline RL as supervised learning with reward; underestimating extrapolation error empirically; treating BC as automatic when it is the explicit baseline.
“What you should remember” with 5 bullets.
L15 setup: BCQ (action constraint), CQL (Q penalty), IQL (max sidestep).

Practice (practice.mdx)

Two exercises plus six flashcards.

Setting classification (5 scenarios): hospital ICU records, robot in simulator, self-driving with epsilon-greedy exploration, recommender with logged A/B test data, industrial plant with logs + uncertain simulator. Classify each as online / off-policy with online interaction / offline and name the dominant safety concern. The recommender and ICU cases anchor the lesson’s healthcare-and-deployment framing. The industrial case is intentionally mixed to surface the offline-then-online pattern.
Q-value divergence trace: walk through three Bellman iterations on the two-state MDP from the body. Compute the diverged Q-function. Compare expected discounted return of (a) behavior policy (≈ 0.878, recursive solution accounting for the a2 self-loop) vs (b) greedy policy on diverged Q (-9). Make the gap quantitative (≈ 9.878).

Six flashcards: offline vs off-policy distinction; failure mechanism (max + extrapolation); extrapolation error definition; why DQN online works but offline diverges; BC as baseline; preview of BCQ / CQL / IQL families.

Cheatsheet (cheatsheet.mdx)

Tables. Three-settings comparison (online / off-policy online / offline). Three sources of extrapolation error. Online correction channels closed offline. Worked example numbers. BC vs naive offline Q-learning vs offline RL comparison. Three-fix preview (BCQ / CQL / IQL with mechanism + what each constrains).

References (references.mdx)

Primary source: CS285. Problem definition and survey: Levine et al. 2020 tutorial review. Extrapolation error mechanism: Fujimoto et al. 2019 BCQ paper, Kumar et al. 2019 BEAR. Benchmarks: Fu et al. 2020 D4RL, Gulcehre et al. 2020 RL Unplugged. Real-world applications: Komorowski et al. 2018 sepsis, Chen et al. 2019 YouTube recommender, Kalashnikov et al. 2018 QT-Opt manipulation. BC baseline: Pomerleau 1989 ALVINN, Ross and Bagnell 2010. RLHF connection: Ouyang et al. 2022 InstructGPT (cited for the KL-regularization parallel to BCQ’s action constraint).

Editorial discipline

Stage 2 sweep: em/en dashes (0), inline math backticks in lesson.mdx outside fenced blocks (0), Greek letters in prose spelled out (gamma, epsilon, alpha as relevant; symbols only in fenced display blocks), placeholder comments present on brief, lowercase-letter convention in practice exercise enumeration.
§6 watch-zone: this lesson abuts the L13 RLHF policy-discussion territory (the InstructGPT citation in references is for the structural KL-regularization parallel, not for vendor advocacy or policy debate). Strictly technical framing throughout.
Vendor naming: DeepMind, Google Robotics, OpenAI named as paper authors (positive citations only); MIT-affiliated authors named on the sepsis paper; no anonymization triggers.
A1 verbatim discipline: no vendor quotations in this lesson; the citations to InstructGPT and the canonical offline-RL papers are bibliographic only.

Word counts

Lesson 2415
Practice 1950
Summary 720
Cheatsheet 805
References 720
Brief 1075

Total ≈ 7685 words across 6 artifacts.

Notes for promotion

Component placeholders (�J0�, �J1�) live as MDX comments. The �J2� is configured for CS285 “Offline RL Introduction” by Sergey Levine.
Practice uses real �J0� + �J1� component imports.
Prereq path form for the L7 DQN reference (cited in lesson body and practice): lessons/deep-reinforcement-learning/dqn (the within-track resolvable form). For L2 imitation learning: lessons/deep-reinforcement-learning/imitation-learning. For L13 RLHF: lessons/deep-reinforcement-learning/rlhf.
Lesson body uses fenced display blocks for the Bellman target equation, the dataset notation, and the MDP specification; Greek symbols stay in those fenced blocks. Prose spells gamma, alpha, epsilon.
This is the first of two offline-RL lessons. L15 is the practical companion; the brief’s “What you will learn” is intentionally narrow on the problem-definition side, not the algorithms.