Skip to content

References: Policy gradient and the path to modern RL

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 7:
Policy Gradient Methods
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the policy-gradient material in
Silver's Lecture 7 (the policy-gradient theorem at intuition level, REINFORCE,
actor-critic blueprint) and adds the closing bridge to the modern landscape
(PPO as the workhorse, RLHF on LLMs as the named applied case, T5 cross-ref
to rlhf-and-dpo for the alignment side). The 2-action and 3-action softmax
worked one-step examples (with explicit pi(rewarded action) climbing 0.50 ->
0.55 and 0.333 -> 0.403 respectively) are Clawdemy framing designed to make
the REINFORCE intuition tangible. PPO's clipping mechanism is named but not
derived (engineering-level); SAC's entropy regularization likewise. Exact
per-lecture URLs are verified at promotion.

A short, durable list. All free.

  • Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 13 (Policy Gradient Methods). The textbook treatment, including the policy-gradient theorem proof, REINFORCE with a baseline, and actor-critic methods.
  • Schulman et al., “Proximal Policy Optimization Algorithms” (2017) — the PPO paper. The clipping mechanism that made policy gradient stable enough to be the modern workhorse. Available widely online.
  • Clawdemy, Track 5 (AI Foundations), rlhf-and-dpo lesson. The other side of the RLHF picture: the LM-alignment perspective. This track teaches the RL mechanics RLHF assumes; T5 teaches what RLHF is doing at the alignment level. Reading both closes the loop.

Where this leads beyond this track.

  • Function approximation and deep RL. The previous lesson. The function-approximation move done there for V/Q is paralleled here for the policy itself (pi_theta with a neural network).
  • Q-learning. Lesson 8. The value-based control alternative. Modern systems often blend value-based and policy-based (actor-critic is the canonical hybrid).
  • Track 5 (AI Foundations), rlhf-and-dpo. Where the policy-gradient bridge in this lesson lands on the alignment side: RLHF for LLMs as an alignment technique, with the DPO refinement that avoids the explicit PPO step.
  • Out of scope for this track but natural next steps. Model-based RL (Dyna, MuZero), exploration in depth (intrinsic motivation, UCB, Thompson sampling), partial observability (POMDPs, recurrent policies), multi-agent RL, imitation learning, offline RL.