Skip to content

References: Offline RL, the problem

  • Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. https://arxiv.org/abs/2005.01643 The canonical survey paper. Defines the problem, names the failure mode, surveys the algorithms (BCQ, CQL, IQL among others), and enumerates open problems. Read sections 2 (problem setup) and 3 (challenges) for the framing this lesson uses.

Extrapolation error and the failure mechanism

Section titled “Extrapolation error and the failure mechanism”
  • Fujimoto, S., Meger, D., & Precup, D. (2019). Off-Policy Deep Reinforcement Learning without Exploration. ICML 2019. https://arxiv.org/abs/1812.02900 The original BCQ paper. Section 4 names extrapolation error explicitly and provides the toy MDP analysis that motivates this lesson’s two-state worked example. Section 5 demonstrates divergence empirically on standard benchmarks.
  • Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS 2019. https://arxiv.org/abs/1906.00949 The BEAR paper. Decomposes the failure into bootstrapping error and analyzes how distributional shift compounds across Bellman updates. Useful complement to Fujimoto et al. on the mechanism.
  • Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. https://arxiv.org/abs/2004.07219 The standard offline-RL benchmark. Locomotion, manipulation, navigation, and Adroit hand tasks, each with several behavior-policy distributions (medium, expert, mixed). The benchmark that made cross-algorithm comparisons meaningful in the field.
  • Gulcehre, C., Wang, Z., Novikov, A., et al. (2020). RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning. NeurIPS 2020. https://arxiv.org/abs/2006.13888 Parallel benchmark suite from DeepMind. Covers DM Control Suite, DMLab, Atari, and real-world robotics datasets.

Applications and motivating real-world settings

Section titled “Applications and motivating real-world settings”
  • Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24, 1716-1720. https://www.nature.com/articles/s41591-018-0213-5 The canonical healthcare offline-RL paper. Sepsis treatment from MIMIC-III data. Read it as the kind of high-stakes setting where the L14 failure mode would be catastrophic.
  • Chen, M., Beutel, A., Covington, P., et al. (2019). Top-K Off-Policy Correction for a REINFORCE Recommender System. WSDM 2019. https://arxiv.org/abs/1812.02353 The YouTube recommender offline-RL paper. Production-scale offline policy improvement with explicit off-policy correction.
  • Kalashnikov, D., Irpan, A., Pastor, P., et al. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. CoRL 2018. https://arxiv.org/abs/1806.10293 The Google Robotics offline-then-online manipulation paper. Offline pretraining on demonstration data, then bounded online refinement. Cited as the canonical hybrid pipeline.
Section titled “Related: implicit constraint via KL regularization (the RLHF connection)”
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 The InstructGPT paper covered in L13 RLHF. Cited here because RLHF’s PPO step is offline-RL-adjacent (fixed preference dataset, no environment exploration on the deployment distribution), but escapes divergence via an explicit KL penalty to the SFT reference policy. The KL term plays the same role BCQ’s action constraint plays: it keeps the trained policy close to the data distribution. Worth re-reading from the L14 angle.
Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.