Skip to content

References: Actor-critic methods

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning, Lecture 6: Actor Critic
Instructor: Sergey Levine
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos (Fall 2023 recordings, most recent at time of authoring):
https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original walkthrough of actor-critic with a
quantified variance-reduction worked example reusing the L4 sigmoid bandit,
following the pedagogical arc of CS285 Lecture 6. We cite the lecture as the
recommended full-depth companion; we do not reproduce or transcribe the videos.
All rights to the original lectures remain with the creator.
  • CS285 Lecture 6, Actor Critic (Sergey Levine, Berkeley). The lecture this lesson mirrors. Levine derives the variance-reduction motivation, presents the actor-critic skeleton, works through the bias-variance tradeoff with explicit notation, and previews GAE before PPO. Watching the bias appear and disappear as the critic improves is the clearest way to feel why the tradeoff matters in practice.

Where this sits in the wider curriculum.

  • REINFORCE (previous lesson). Actor-critic is REINFORCE with a learned baseline (V_φ) replacing the no-baseline or hand-chosen baseline of the bare estimator. The L4 sigmoid bandit returns here as the quantified-variance-reduction worked example: Var(g_REINFORCE) = 0.0625 becomes Var(g_AC) = 0 with the optimal V*. The bias-variance tradeoff this lesson introduces is the design dial every later policy-gradient method tunes.

  • Advanced policy gradients: TRPO and PPO (lesson 8). PPO is actor-critic with GAE and a clipped trust-region objective. The actor-critic skeleton in this lesson is what PPO builds on; the clipping and trust-region material is what lesson 8 adds.

  • Value-based RL (lessons 6 and 7, opening Phase 2). The next phase takes the other branch of the L3 dispatch table: learn Q_θ(s, a) directly and act greedily, with no explicit policy network. Lesson 6 derives Q-learning from the Bellman optimality equation; lesson 7 covers the practical stabilizers (replay buffer, target network) that turn it into DQN.

  • RL for large language models (lesson 13, RLHF). The PPO used to post-train LLMs is an actor-critic algorithm in this lesson’s sense. The actor is the LM; the critic is a value-network head added during fine-tuning. The advantage is computed with GAE.