References: Policy gradient and the path to modern RL

Source material

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 7:
  Policy Gradient Methods
  Author: David Silver
  Course page: https://davidstarsilver.wordpress.com/teaching/
  License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause is now consistent with Clawdemy's own CC BY-NC-SA 4.0 license; both forbid commercial use without permission. Commercial use is licensed separately at [/legal/licensing](/legal/licensing/).
All rights to the original materials remain with the author and UCL.

Source-scope note: this lesson mirrors the policy-gradient material in
Silver's Lecture 7 (the policy-gradient theorem at intuition level, REINFORCE,
actor-critic blueprint) and adds the closing bridge to the modern landscape
(PPO as the workhorse, RLHF on LLMs as the named applied case, T5 cross-ref
to rlhf-and-dpo for the alignment side). The 2-action and 3-action softmax
worked one-step examples (with explicit pi(rewarded action) climbing 0.50 ->
0.55 and 0.333 -> 0.403 respectively) are Clawdemy framing designed to make
the REINFORCE intuition tangible. PPO's clipping mechanism is named but not
derived (engineering-level); SAC's entropy regularization likewise. Exact
per-lecture URLs are verified at promotion.

Going deeper

A short, durable list. All free.

Sutton and Barto, “Reinforcement Learning: An Introduction” (2nd edition), Chapter 13 (Policy Gradient Methods). The textbook treatment, including the policy-gradient theorem proof, REINFORCE with a baseline, and actor-critic methods.
Schulman et al., “Proximal Policy Optimization Algorithms” (2017) — the PPO paper. The clipping mechanism that made policy gradient stable enough to be the modern workhorse. Available widely online.
Clawdemy, Track 5 (AI Foundations), rlhf-and-dpo lesson. The other side of the RLHF picture: the LM-alignment perspective. This track teaches the RL mechanics RLHF assumes; T5 teaches what RLHF is doing at the alignment level. Reading both closes the loop.

Adjacent topics

Where this leads beyond this track.

Function approximation and deep RL. The previous lesson. The function-approximation move done there for V/Q is paralleled here for the policy itself (pi_theta with a neural network).
Q-learning. Lesson 8. The value-based control alternative. Modern systems often blend value-based and policy-based (actor-critic is the canonical hybrid).
Track 5 (AI Foundations), rlhf-and-dpo. Where the policy-gradient bridge in this lesson lands on the alignment side: RLHF for LLMs as an alignment technique, with the DPO refinement that avoids the explicit PPO step.
Out of scope for this track but natural next steps. Model-based RL (Dyna, MuZero), exploration in depth (intrinsic motivation, UCB, Thompson sampling), partial observability (POMDPs, recurrent policies), multi-agent RL, imitation learning, offline RL.

References: Policy gradient and the path to modern RL

Source material

Read this next

Going deeper

Adjacent topics