Skip to content

Cheatsheet: Actor-critic methods

NetworkSymbolRole
Actor`π_θ(as)`
CriticV_φ(s) or Q_φ(s, a)learned value-function estimate

Trained jointly: actor uses critic for lower-variance gradient signal; critic regresses to observed returns supplied by the actor.

VariantAdvantage A(s_t, a_t)BiasVariance
MC actor-criticG_t - V_φ(s_t)none (unbiased for any state-only V_φ; baseline identity)high
TD actor-critic (TD(0))r_t + γ V_φ(s_(t+1)) - V_φ(s_t)high (V_φ in the bootstrapped target on both sides)low
n-stepΣ_(k=0)^(n-1) γ^k r_(t+k) + γ^n V_φ(s_(t+n)) - V_φ(s_t)interpolates (bias enters via the bootstrapped V_φ(s_(t+n)) term)interpolates
GAE(λ)geometric λ-weighted sum of n-stepλ = 0: TD (most bias); λ = 1: MC (no bias)tunable

PPO default: GAE with λ ≈ 0.95 (closer to MC; small bootstrap, large baseline).

for each iteration:
collect rollouts by running π_θ
compute G_t for every t
critic update: φ ← φ - β · ∇_φ (G_t - V_φ(s_t))² (regression)
actor update: θ ← θ + α · (G_t - V_φ(s_t)) · ∇_θ log π_θ(a_t|s_t)

For TD: replace G_t with r_t + γ V_φ(s_(t+1)) in both lines.

Quantified variance reduction (L4 sigmoid bandit, θ=0)

Section titled “Quantified variance reduction (L4 sigmoid bandit, θ=0)”
Setup: R(a=1) = 1, R(a=2) = 0, π_θ(a=1) = σ(θ), σ(0) = 0.5
Optimal baseline V* = E_π[R] = 0.5
Estimatora=1 sample ga=2 sample gE[g]Var(g)std(g)SNR
REINFORCE (no baseline)1·(1-0.5) = 0.50·(-0.5) = 00.250.06250.251.0
Actor-critic (V*)(1-0.5)·(1-0.5) = 0.25(0-0.5)·(-0.5) = 0.250.2500

The optimal baseline collapses variance to zero on deterministic-reward problems. Real V_φ is approximate, so Var(g_AC) > 0 in practice, but typically much less than Var(g_REINFORCE).

ChoiceStatus
V_φ(s) (state only)valid: `E_(a~π)[V_φ(s)·∇log π(a
Any b(s) not depending on avalid, same reason
Q_φ(s, a) (depends on a)invalid as a baseline: does not factor out; biases the gradient
Constant cvalid but useless; does not reduce variance

MC: unbiased (for any state-only V_φ; the baseline identity holds regardless of critic error), high variance. TD: high bias, low variance. Bias enters when V_φ shows up in the target (bootstrap), not when it shows up as a baseline. GAE(λ): tune the dial; λ = 1 is unbiased MC, λ = 0 is biased TD. PPO defaults to λ ≈ 0.95 (moderate trade closer to MC). The right λ is the practical hinge the family lives on.

AlgorithmCriticNotes
A2C / A3C (Mnih et al. 2016)V_φsynchronous / asynchronous, the original deep-RL AC
PPO (Schulman 2017)V_φ + GAEclipped trust-region update; RLHF backbone (L8 + L13)
SAC (Haarnoja et al. 2018)Q_φ (twin)entropy-regularized, continuous control workhorse
DDPG / TD3Q_φdeterministic policy, off-policy continuous control
  • Forgetting the critic is learned. V_φ has its own approximation error; early in training the advantage is biased.
  • Using Q_φ(s, a) as a baseline. Action-dependent → biases the gradient. V_φ(s) is the legit baseline; Q_φ can replace G_t in some variants.
  • TD(0) too early. Pure bootstrap with a wrong V_φ propagates error. Default to GAE with λ ≈ 0.95.
  • Conflating actor and critic objectives. Critic loss is MSE regression to target. Actor loss is policy gradient with advantage. Two networks, two losses, two gradient flows.

Actor-critic trains a policy π_θ and a value-function critic V_φ jointly; the critic supplies a learned baseline (or bootstrapped target) for the policy gradient, trading a small bias for a large variance reduction (on the L4 bandit at θ=0: Var from 0.0625 to 0, SNR from 1 to ∞ with the optimal baseline).