References: Actor-critic methods

Source material

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning, Lecture 6: Actor Critic
  Instructor: Sergey Levine
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos (Fall 2023 recordings, most recent at time of authoring):
    https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
  License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original walkthrough of actor-critic with a
quantified variance-reduction worked example reusing the L4 sigmoid bandit,
following the pedagogical arc of CS285 Lecture 6. We cite the lecture as the
recommended full-depth companion; we do not reproduce or transcribe the videos.
All rights to the original lectures remain with the creator.

Watch this next

CS285 Lecture 6, Actor Critic (Sergey Levine, Berkeley). The lecture this lesson mirrors. Levine derives the variance-reduction motivation, presents the actor-critic skeleton, works through the bias-variance tradeoff with explicit notation, and previews GAE before PPO. Watching the bias appear and disappear as the critic improves is the clearest way to feel why the tradeoff matters in practice.

Going deeper (foundational papers)

Asynchronous Methods for Deep Reinforcement Learning (Mnih et al., ICML 2016). The A3C paper. Introduces asynchronous parallel actor-critic, the first deep-RL algorithm to be the workhorse of the field. Its synchronous counterpart A2C is described in OpenAI Baselines and is what most people now use; the algorithm template both share is the actor-critic skeleton in this lesson.
High-dimensional continuous control using generalized advantage estimation (Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016). The GAE paper (also cited in L4). The full derivation of λ-weighted advantage estimation, with the bias-variance analysis quantifying what λ buys. The standard advantage estimator in PPO and most modern policy-gradient implementations.
Soft Actor-Critic: Off-policy maximum entropy deep RL (Haarnoja, Zhou, Abbeel, Levine, ICML 2018). The SAC paper. Actor-critic with a Q_φ critic, entropy regularization in the policy objective, and twin Q-networks for stability. The dominant algorithm for continuous-control robotics today, and the canonical example of Q-critic actor-critic.
Proximal Policy Optimization Algorithms (Schulman, Wolski, Dhariwal, Radford, Klimov, 2017). PPO. Actor-critic with a clipped trust-region objective. Cited as forward reference since L8 covers it; it is the algorithm used in the RLHF post-training step of modern LLMs (lesson 13).

Going deeper (textbooks and tutorials)

Reinforcement Learning: An Introduction (Sutton and Barto, 2nd edition), Chapter 13.5: Actor-Critic Methods. The textbook treatment, with the policy-gradient theorem extended to the actor-critic case and the bias-variance analysis in full. Chapter 12 on eligibility traces is the underlying math for n-step returns and GAE.
Spinning Up in Deep RL: VPG, A2C, PPO, SAC. Achiam’s pedagogical implementations with pseudocode, derivation, and working code for the actor-critic family. Useful as the practical companion when you implement these for the first time.

Adjacent topics

Where this sits in the wider curriculum.

REINFORCE (previous lesson). Actor-critic is REINFORCE with a learned baseline (V_φ) replacing the no-baseline or hand-chosen baseline of the bare estimator. The L4 sigmoid bandit returns here as the quantified-variance-reduction worked example: Var(g_REINFORCE) = 0.0625 becomes Var(g_AC) = 0 with the optimal V*. The bias-variance tradeoff this lesson introduces is the design dial every later policy-gradient method tunes.
Advanced policy gradients: TRPO and PPO (lesson 8). PPO is actor-critic with GAE and a clipped trust-region objective. The actor-critic skeleton in this lesson is what PPO builds on; the clipping and trust-region material is what lesson 8 adds.
Value-based RL (lessons 6 and 7, opening Phase 2). The next phase takes the other branch of the L3 dispatch table: learn Q_θ(s, a) directly and act greedily, with no explicit policy network. Lesson 6 derives Q-learning from the Bellman optimality equation; lesson 7 covers the practical stabilizers (replay buffer, target network) that turn it into DQN.
RL for large language models (lesson 13, RLHF). The PPO used to post-train LLMs is an actor-critic algorithm in this lesson’s sense. The actor is the LM; the critic is a value-network head added during fine-tuning. The advantage is computed with GAE.