Skip to content

References: Introduction to deep reinforcement learning

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning, Lecture 1: Introduction
Instructor: Sergey Levine
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos (Fall 2023 recordings, most recent at time of authoring):
https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original orientation around the agent-environment loop,
the three ML regimes, and a worked return computation, following the pedagogical arc
of Levine's CS285 introduction. We cite it as the recommended full-depth companion;
we do not reproduce or transcribe the videos. All rights to the original lectures
remain with the creator.
  • CS285 Lecture 1, Introduction (Sergey Levine, Berkeley), the first lecture of the Fall 2023 recording. The first 30-40 minutes give the full agent-environment framing and the high-level case for deep RL with the headline examples animated. Levine is the canonical lecturer in the field, and the lecture stays accessible while pointing forward to the mathematical depth the course (and this track) builds.

A short, durable list. Each link is a specific next step, not a generic pile.

  • Reinforcement Learning: An Introduction (Sutton and Barto, 2nd edition). The canonical textbook of the field, freely available online from one of its authors. Chapters 1-3 cover the agent-environment loop, MDPs, and the value-function ideas at the same orientation level as this lesson, with the precision a textbook gives that a video does not. If T17 (RL foundations) has not built the MDP material for you yet, this is the read.

  • Spinning Up in Deep RL (Joshua Achiam, OpenAI). A free, hands-on introduction specifically to deep RL, with code, math, and pedagogical pseudocode for the algorithms this track will cover (policy gradients, actor-critic, TRPO, PPO, DDPG, SAC). Designed to be the practitioner’s companion to a course like CS285.

Where this sits in the wider curriculum.

  • Imitation learning (next lesson). The simplest approach to learning a policy: ignore the reward and just copy an expert. Lesson 2 shows where behavioral cloning works, where it breaks (distribution shift), and why this failure mode motivates everything that follows.

  • T17 (RL foundations). Classical RL, MDPs, dynamic programming, and tabular methods. T18 (this track) is the deep variant; T17 is the foundation it builds on. If you have not encountered the MDP formalism before, T17 is the prerequisite that makes the math here land.

  • T13 (Build Neural Networks from Scratch) and T11 (Neural Network Intuition). Deep RL is RL with a neural network in place of a table. T11 builds the picture of a network as a function; T13 builds gradient descent and backpropagation from scratch. Both are assumed background for this track.

  • Track 5 (AI Foundations), RLHF lessons. The RLHF pipeline (preference data → reward model → PPO fine-tuning) underpins ChatGPT, Claude, and Gemini. Lesson 13 of this track works it as a Deep-RL application; Track 5 covers the LLM side of the pipeline.