References: RLHF
Primary sources (load-bearing for this lesson)
Section titled “Primary sources (load-bearing for this lesson)”The InstructGPT pipeline
Section titled “The InstructGPT pipeline”- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 InstructGPT. The canonical RLHF pipeline; SFT + RM + PPO + KL at scale.
- Stiennon, N., Ouyang, L., Wu, J., et al. (2020). Learning to summarize with human feedback. NeurIPS 2020. https://arxiv.org/abs/2009.01325 Pre-InstructGPT scaled RLHF on summarization. The methodological foundation.
- Christiano, P. F., Leike, J., Brown, T. B., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS 2017. https://arxiv.org/abs/1706.03741 The original deep-RL-from-preferences paper on Atari and MuJoCo.
Anthropic’s RLHF and Constitutional AI
Section titled “Anthropic’s RLHF and Constitutional AI”- Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862 Anthropic’s RLHF paper.
- Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073 The Constitutional AI / RLAIF paper.
Variants and successors
Section titled “Variants and successors”- Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290 DPO. The variational shortcut skipping the explicit reward model and the PPO loop.
- Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300. https://arxiv.org/abs/2402.03300 The origin paper for GRPO (Group Relative Policy Optimization): drop the value-network critic and use group-normalized rewards as the advantage.
- DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948 The popularizer of GRPO via large-scale reasoning RL; DeepSeekMath introduced the method.
- Azar, M. G., Rowland, M., Piot, B., et al. (2024). A General Theoretical Paradigm to Understand Learning from Human Preferences. AISTATS 2024. https://arxiv.org/abs/2310.12036 IPO; theoretical generalization of DPO.
- Ethayarajh, K., Xu, W., Muennighoff, N., et al. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. ICML 2024. https://arxiv.org/abs/2402.01306 Another preference-optimization variant; uses prospect theory to handle asymmetric preference data.
Operational instruments
Section titled “Operational instruments”- Perez, E., Ringer, S., Lukošiūtė, K., et al. (2023). Discovering Language Model Behaviors with Model-Written Evaluations. Findings of ACL 2023. https://arxiv.org/abs/2212.09251 Sycophancy and related-behaviors benchmark; the Anthropic evaluation suite.
- Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548. https://arxiv.org/abs/2310.13548 Focused empirical study of sycophancy in RLHF-tuned models.
- Mazeika, M., Phan, L., Yin, X., et al. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. ICML 2024. https://arxiv.org/abs/2402.04249 HarmBench; the standardized red-team benchmark.
- Hendrycks, D., Burns, C., Basart, S., et al. (2021). Measuring Massive Multitask Language Understanding. ICLR 2021. https://arxiv.org/abs/2009.03300 MMLU; the standard general-capability eval used to detect RLHF capability degradation.
- Cobbe, K., Kosaraju, V., Bavarian, M., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168. https://arxiv.org/abs/2110.14168 GSM8K; math-reasoning eval, particularly sensitive to reasoning-degradation failure modes of RLHF.
- Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374. https://arxiv.org/abs/2107.03374 HumanEval; code-generation eval used as a capability-retention proxy.
The Bradley-Terry preference model
Section titled “The Bradley-Terry preference model”- Bradley, R. A., & Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345. The original Bradley-Terry paper from 70 years before RLHF; still the standard parameterization for preference data.
- Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. Annals of Statistics, 32(1), 384-406. https://www.jstor.org/stable/3448512 A modern treatment of fitting Bradley-Terry models.
Open-source implementations
Section titled “Open-source implementations”- Huang, S., Dossa, R. F. J., Raffin, A., et al. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track 2022. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ The PPO implementation-details deep-dive that became the de facto reference for RLHF PPO implementations.
- TRL (Transformer Reinforcement Learning): https://github.com/huggingface/trl Hugging Face’s open-source RLHF / DPO / PPO library. The reference implementation used in many academic and industry RLHF runs.
- trlx: https://github.com/CarperAI/trlx CarperAI’s RLHF library. Predates TRL; complementary.
Recent surveys
Section titled “Recent surveys”- Casper, S., Davies, X., Shi, C., et al. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Transactions on Machine Learning Research. https://arxiv.org/abs/2307.15217 Comprehensive survey of RLHF’s documented limitations and open research questions. Read together with this lesson if you want the field-wide perspective.
- Kaufmann, T., Weng, P., Bengs, V., & Hüllermeier, E. (2024). A Survey of Reinforcement Learning from Human Feedback. arXiv:2312.14925. https://arxiv.org/abs/2312.14925 Another recent survey with somewhat different emphasis.
Berkeley CS285 (course source for this track)
Section titled “Berkeley CS285 (course source for this track)”CS285 does not have a dedicated RLHF lecture as of the 2023 syllabus. This lesson draws from the primary papers above and from the variational framework that CS285 L18-L19 set up (Lessons 11-12 of this track). For the broader landscape of RL applied to language models, see also Anthropic’s research blog, the OpenAI tech report on GPT-4, and the DeepMind alignment team’s publications.
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine) Course page: http://rail.eecs.berkeley.edu/deeprlcourse/ Lecture videos: YouTube (link-out only)Clawdemy's lessons are original prose that follows the pedagogical arc of thissource. We do not reproduce or transcribe it; we cite it as a recommendedcompanion. All rights to the original material remain with its authors.