Skip to content

References: RL fundamentals (MDPs, value functions, Bellman)

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning, Lecture 4: RL Basics
Instructor: Sergey Levine
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos (Fall 2023 recordings, most recent at time of authoring):
https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original walkthrough of the MDP formalism, the
value/Q/advantage functions, and the Bellman equation, following the
pedagogical arc of Levine's CS285 Lecture 4. We cite the lecture as the
recommended full-depth companion; we do not reproduce or transcribe the videos.
All rights to the original lectures remain with the creator.
  • CS285 Lecture 4, RL Basics (Sergey Levine, Berkeley). The lecture this lesson mirrors. Levine works the MDP definition with explicit notation, derives the Bellman equation step by step, and previews how every subsequent CS285 lecture (and every later lesson in this track) is defined against the objects introduced here. About one hour; the first 30 minutes are the most directly applicable to this lesson.
  • Reinforcement Learning: An Introduction (Sutton and Barto, 2nd edition), the canonical RL textbook, free online from one of the authors. Chapter 3 (Finite Markov Decision Processes) is the textbook treatment of the MDP tuple, the Markov property, returns, and the value functions, at the same orientation level as this lesson with the precision a textbook gives. Chapter 4 (Dynamic Programming) covers value iteration and policy iteration, the classical algorithms that solve the Bellman optimality equation when the dynamics P are known. Bellman’s original work (Richard Bellman, Dynamic Programming, 1957) is where the equation got its name; Sutton and Barto Chapter 3.7 traces the historical lineage.

  • Spinning Up in Deep RL: Key Concepts (Joshua Achiam, OpenAI). A concise reference for the MDP + value-function + Bellman objects, with the deep-RL flavor of the notation you will see in later papers in this track. Useful as a quick-lookup once the formalism is in hand.

Where this sits in the wider curriculum.

  • Policy gradients (next lesson). Lesson 4 takes the most direct route to improving a policy: parameterize π_θ as a neural network and follow the gradient of the expected return. The derivation uses the value-function vocabulary built here, and the REINFORCE algorithm that results is the foundation of every policy-gradient method in the track.

  • Value-based RL (lessons 6 and 7). The Bellman optimality equation Q*(s, a) = R(s, a) + γ Σ P(s'|s,a) max_a' Q*(s', a') is the equation that Q-learning trains a neural network Q_θ to satisfy. Lessons 6 and 7 build that from scratch.

  • Actor-critic and advanced PG (lessons 5 and 8). These use the value functions defined here (V^π, Q^π, and especially the advantage A^π = Q^π - V^π) as variance-reduced training signals for the policy.

  • T17 (RL Foundations, in parallel). T17 covers classical tabular RL in depth: dynamic programming, value iteration, policy iteration, temporal-difference learning, Monte Carlo methods. This T18 lesson assumes the MDP material in T17 Chapters 3-4 (or Sutton-and-Barto equivalents). If the linear-system Bellman equation here felt too compressed, T17 (or Sutton-and-Barto Chapter 4) is the slower walkthrough.

  • T11 (Neural Network Intuition) and T13 (Build Neural Networks from Scratch). The “deep” in deep RL means the value or policy is a neural network rather than a table. Those tracks build the network side of the picture; this lesson and the rest of T18 build the RL side.