References: Offline RL, the problem

Primary source

Levine, S. (2023). Berkeley CS285, Deep Reinforcement Learning, lecture on Offline RL: Introduction. Course materials at http://rail.eecs.berkeley.edu/deeprlcourse/. Lecture video at https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps.

The problem definition + survey

Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. https://arxiv.org/abs/2005.01643 The canonical survey paper. Defines the problem, names the failure mode, surveys the algorithms (BCQ, CQL, IQL among others), and enumerates open problems. Read sections 2 (problem setup) and 3 (challenges) for the framing this lesson uses.

Extrapolation error and the failure mechanism

Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. ICML 2019. https://arxiv.org/abs/1812.02900 The original BCQ paper. Section 4 names extrapolation error explicitly and provides the toy MDP analysis that motivates this lesson’s two-state worked example. Section 5 demonstrates divergence empirically on standard benchmarks.
Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS 2019. https://arxiv.org/abs/1906.00949 The BEAR paper. Decomposes the failure into bootstrapping error and analyzes how distributional shift compounds across Bellman updates. Useful complement to Fujimoto et al. on the mechanism.

Benchmarks and datasets

Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. https://arxiv.org/abs/2004.07219 The standard offline-RL benchmark. Locomotion, manipulation, navigation, and Adroit hand tasks, each with several behavior-policy distributions (medium, expert, mixed). The benchmark that made cross-algorithm comparisons meaningful in the field.
Gulcehre, C., Wang, Z., Novikov, A., et al. (2020). RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning. NeurIPS 2020. https://arxiv.org/abs/2006.13888 Parallel benchmark suite from DeepMind. Covers DM Control Suite, DMLab, Atari, and real-world robotics datasets.

Applications and motivating real-world settings

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24, 1716-1720. https://www.nature.com/articles/s41591-018-0213-5 The canonical healthcare offline-RL paper. Sepsis treatment from MIMIC-III data. Read it as the kind of high-stakes setting where the L14 failure mode would be catastrophic.
Chen, M., Beutel, A., Covington, P., et al. (2019). Top-K Off-Policy Correction for a REINFORCE Recommender System. WSDM 2019. https://arxiv.org/abs/1812.02353 The YouTube recommender offline-RL paper. Production-scale offline policy improvement with explicit off-policy correction.
Kalashnikov, D., Irpan, A., Pastor, P., et al. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. CoRL 2018. https://arxiv.org/abs/1806.10293 The Google Robotics offline-then-online manipulation paper. Offline pretraining on demonstration data, then bounded online refinement. Cited as the canonical hybrid pipeline.

Behavioral cloning as baseline

Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS 1989. https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf The original behavioral-cloning paper that demonstrated supervised imitation of driving from logged data. Reference for the BC baseline this lesson cites against offline RL.
Ross, S., & Bagnell, J. A. (2010). Efficient Reductions for Imitation Learning. AISTATS 2010. https://proceedings.mlr.press/v9/ross10a.html The O(epsilon T squared) bound for behavioral cloning, used in the L2 imitation lesson and cited here as the baseline that offline RL aspires to exceed.

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 The InstructGPT paper covered in L13 RLHF. Cited here because RLHF’s PPO step is offline-RL-adjacent (fixed preference dataset, no environment exploration on the deployment distribution), but escapes divergence via an explicit KL penalty to the SFT reference policy. The KL term plays the same role BCQ’s action constraint plays: it keeps the trained policy close to the data distribution. Worth re-reading from the L14 angle.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.

References: Offline RL, the problem

Primary source

The problem definition + survey

Extrapolation error and the failure mechanism

Benchmarks and datasets

Applications and motivating real-world settings

Behavioral cloning as baseline

Related: implicit constraint via KL regularization (the RLHF connection)

Source material