Actor-critic methods: brief

What you’ll learn

The last lesson ended with REINFORCE working but high-variance: signal-to-noise ratio of exactly 1 on a sigmoid bandit at θ = 0. The natural fix is the one the baseline trick already pointed at: subtract a function of the state that cancels the predictable part of the return. Actor-critic methods make that baseline a learned object, the critic, trained alongside the policy (the actor). The single capability this lesson builds: state the actor-critic skeleton, quantify the variance reduction with the optimal baseline, and recognize the bias-variance tradeoff that the actor-critic family lives or dies on.

You will meet the two-network split (actor π_θ(a|s), critic V_φ(s) or Q_φ(s, a), trained jointly); the MC actor-critic advantage A = G_t - V_φ(s_t) (unbiased baseline, high-variance G_t) and the TD actor-critic advantage A = r_t + γ V_φ(s_(t+1)) - V_φ(s_t) (bootstrapped target, low variance, high bias when V_φ is wrong); n-step return and GAE(λ) as smooth interpolations (λ = 0 is TD, λ = 1 is MC; PPO defaults to λ ≈ 0.95); the training loop (critic regresses to the return target; actor uses the critic’s advantage in REINFORCE-style update); a worked quantified variance reduction on the L4 bandit (REINFORCE: Var = 0.0625, SNR = 1; actor-critic with optimal V* = 0.5: both single-sample gradients exactly 0.25, Var = 0, SNR = ∞); the bias the variance reduction costs (real V_φ is learned and not equal to V^π); and the rule that valid baselines are state-only (V_φ(s) is fine; Q_φ(s, a) would bias the gradient).

Where this fits

This is lesson 5 of Phase 1 (RL foundations) and closes Phase 1. Phase 1 has built imitation learning (L2), the MDP formalism (L3), REINFORCE (L4), and now actor-critic (L5), the full policy-gradient picture. Phase 2 opens on the other branch of the L3 dispatch table: value-based RL. Lesson 6 derives Q-learning from the Bellman optimality equation and shows what changes when the network learns Q_θ(s, a) directly with no explicit policy network. The actor-critic skeleton built here returns in lesson 8 (PPO is actor-critic + GAE + clipped trust region) and lesson 13 (RLHF is PPO applied to language models).

Before you start

Prerequisite (within this track): lesson 4, Policy gradients (REINFORCE), since actor-critic is REINFORCE with a learned baseline. The worked example reuses the L4 sigmoid bandit and extends the variance computation from there. You also lean on the value functions from lesson 3 (the critic’s V_φ is a learned estimate of V^π). No coding, nothing installed; the practice is pen and paper with a calculator for the variance arithmetic.

By the end, you’ll be able to

State the actor-critic skeleton (actor π_θ + critic V_φ trained jointly) and explain why the critic supplies lower-variance gradient signal for the actor
Distinguish the MC, TD, n-step, and GAE advantage estimators by their bias and variance properties
Quantify the variance reduction with the optimal baseline on a sigmoid bandit (Var goes from 0.0625 to 0; SNR from 1 to infinity) and explain why the action-conditional mean reward is the right baseline
Identify which baselines are valid (state-only, like V_φ(s)) and which bias the gradient (action-dependent, like Q_φ(s, a))
Connect actor-critic to the modern algorithm family (A2C, A3C, SAC, PPO) and the RLHF post-training of LLMs

Time and difficulty

Read time: about 13 minutes
Practice time: about 14 minutes (a variance computation on a stochastic-reward bandit showing 43% variance reduction, a valid-baseline classification drill, and flashcards)
Difficulty: standard