References: Policy gradient and the path to modern RL
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• David Silver, "Reinforcement Learning" (UCL course), Lecture 7: Policy Gradient Methods Author: David Silver Course page: https://davidstarsilver.wordpress.com/teaching/ License: CC BY-NC 4.0Clawdemy's lessons are original prose that follows the pedagogical arc of thiscourse. We do not embed, reproduce, or transcribe Silver's slides or videolectures; we link out to the relevant lecture as recommended further study.The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors the policy-gradient material inSilver's Lecture 7 (the policy-gradient theorem at intuition level, REINFORCE,actor-critic blueprint) and adds the closing bridge to the modern landscape(PPO as the workhorse, RLHF on LLMs as the named applied case, T5 cross-refto rlhf-and-dpo for the alignment side). The 2-action and 3-action softmaxworked one-step examples (with explicit pi(rewarded action) climbing 0.50 ->0.55 and 0.333 -> 0.403 respectively) are Clawdemy framing designed to makethe REINFORCE intuition tangible. PPO's clipping mechanism is named but notderived (engineering-level); SAC's entropy regularization likewise. Exactper-lecture URLs are verified at promotion.Read this next
Section titled “Read this next”- David Silver, UCL RL course, Lecture 7: Policy Gradient Methods by David Silver. The lecture this lesson mirrors, with the policy-gradient theorem developed alongside actor-critic and explicit treatment of REINFORCE with and without a baseline. CC BY-NC 4.0, freely available.
Going deeper
Section titled “Going deeper”A short, durable list. All free.
- Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 13 (Policy Gradient Methods). The textbook treatment, including the policy-gradient theorem proof, REINFORCE with a baseline, and actor-critic methods.
- Schulman et al., “Proximal Policy Optimization Algorithms” (2017) — the PPO paper. The clipping mechanism that made policy gradient stable enough to be the modern workhorse. Available widely online.
- Clawdemy, Track 5 (AI Foundations),
rlhf-and-dpolesson. The other side of the RLHF picture: the LM-alignment perspective. This track teaches the RL mechanics RLHF assumes; T5 teaches what RLHF is doing at the alignment level. Reading both closes the loop.
Adjacent topics
Section titled “Adjacent topics”Where this leads beyond this track.
- Function approximation and deep RL. The previous lesson. The function-approximation move done there for V/Q is paralleled here for the policy itself (pi_theta with a neural network).
- Q-learning. Lesson 8. The value-based control alternative. Modern systems often blend value-based and policy-based (actor-critic is the canonical hybrid).
- Track 5 (AI Foundations),
rlhf-and-dpo. Where the policy-gradient bridge in this lesson lands on the alignment side: RLHF for LLMs as an alignment technique, with the DPO refinement that avoids the explicit PPO step. - Out of scope for this track but natural next steps. Model-based RL (Dyna, MuZero), exploration in depth (intrinsic motivation, UCB, Thompson sampling), partial observability (POMDPs, recurrent policies), multi-agent RL, imitation learning, offline RL.