References: Offline RL, the problem
Primary source
Section titled “Primary source”- Levine, S. (2023). Berkeley CS285, Deep Reinforcement Learning, lecture on Offline RL: Introduction. Course materials at http://rail.eecs.berkeley.edu/deeprlcourse/. Lecture video at https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps.
The problem definition + survey
Section titled “The problem definition + survey”- Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. https://arxiv.org/abs/2005.01643 The canonical survey paper. Defines the problem, names the failure mode, surveys the algorithms (BCQ, CQL, IQL among others), and enumerates open problems. Read sections 2 (problem setup) and 3 (challenges) for the framing this lesson uses.
Extrapolation error and the failure mechanism
Section titled “Extrapolation error and the failure mechanism”- Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. ICML 2019. https://arxiv.org/abs/1812.02900 The original BCQ paper. Section 4 names extrapolation error explicitly and provides the toy MDP analysis that motivates this lesson’s two-state worked example. Section 5 demonstrates divergence empirically on standard benchmarks.
- Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS 2019. https://arxiv.org/abs/1906.00949 The BEAR paper. Decomposes the failure into bootstrapping error and analyzes how distributional shift compounds across Bellman updates. Useful complement to Fujimoto et al. on the mechanism.
Benchmarks and datasets
Section titled “Benchmarks and datasets”- Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. https://arxiv.org/abs/2004.07219 The standard offline-RL benchmark. Locomotion, manipulation, navigation, and Adroit hand tasks, each with several behavior-policy distributions (medium, expert, mixed). The benchmark that made cross-algorithm comparisons meaningful in the field.
- Gulcehre, C., Wang, Z., Novikov, A., et al. (2020). RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning. NeurIPS 2020. https://arxiv.org/abs/2006.13888 Parallel benchmark suite from DeepMind. Covers DM Control Suite, DMLab, Atari, and real-world robotics datasets.
Applications and motivating real-world settings
Section titled “Applications and motivating real-world settings”- Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24, 1716-1720. https://www.nature.com/articles/s41591-018-0213-5 The canonical healthcare offline-RL paper. Sepsis treatment from MIMIC-III data. Read it as the kind of high-stakes setting where the L14 failure mode would be catastrophic.
- Chen, M., Beutel, A., Covington, P., et al. (2019). Top-K Off-Policy Correction for a REINFORCE Recommender System. WSDM 2019. https://arxiv.org/abs/1812.02353 The YouTube recommender offline-RL paper. Production-scale offline policy improvement with explicit off-policy correction.
- Kalashnikov, D., Irpan, A., Pastor, P., et al. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. CoRL 2018. https://arxiv.org/abs/1806.10293 The Google Robotics offline-then-online manipulation paper. Offline pretraining on demonstration data, then bounded online refinement. Cited as the canonical hybrid pipeline.
Behavioral cloning as baseline
Section titled “Behavioral cloning as baseline”- Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS 1989. https://proceedings.neurips.cc/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf The original behavioral-cloning paper that demonstrated supervised imitation of driving from logged data. Reference for the BC baseline this lesson cites against offline RL.
- Ross, S., & Bagnell, J. A. (2010). Efficient Reductions for Imitation Learning. AISTATS 2010. https://proceedings.mlr.press/v9/ross10a.html The O(epsilon T squared) bound for behavioral cloning, used in the L2 imitation lesson and cited here as the baseline that offline RL aspires to exceed.
Related: implicit constraint via KL regularization (the RLHF connection)
Section titled “Related: implicit constraint via KL regularization (the RLHF connection)”- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 The InstructGPT paper covered in L13 RLHF. Cited here because RLHF’s PPO step is offline-RL-adjacent (fixed preference dataset, no environment exploration on the deployment distribution), but escapes divergence via an explicit KL penalty to the SFT reference policy. The KL term plays the same role BCQ’s action constraint plays: it keeps the trained policy close to the data distribution. Worth re-reading from the L14 angle.
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine) Course page: http://rail.eecs.berkeley.edu/deeprlcourse/ Lecture videos: YouTube (link-out only)Clawdemy's lessons are original prose that follows the pedagogical arc of thissource. We do not reproduce or transcribe it; we cite it as a recommendedcompanion. All rights to the original material remain with its authors.