Skip to content

References: What reinforcement learning actually is

Source curriculum (structural mirror, cited as further study):
• David Silver, "Reinforcement Learning" (UCL course), Lecture 1:
Introduction to Reinforcement Learning
Author: David Silver
Course page: https://davidstarsilver.wordpress.com/teaching/
License: CC BY-NC 4.0
Clawdemy's lessons are original prose that follows the pedagogical arc of this
course. We do not embed, reproduce, or transcribe Silver's slides or video
lectures; we link out to the relevant lecture as recommended further study.
The non-commercial clause aligns with Clawdemy's free, zero-revenue posture.
All rights to the original materials remain with the author and UCL.
Source-scope note: this lesson mirrors Silver's Lecture 1 (the RL paradigm
and the agent-environment setup) and restates it in Clawdemy's voice with
original framing. The three-paradigm split (supervised / unsupervised /
reinforcement), the explicit "what makes RL harder than supervised" list, the
designed-reward caveat, and the three-arm bandit walk-through used to make
exploration-vs-exploitation concrete are Clawdemy framing. The lesson does
not yet introduce MDPs or value functions; those are the next two lessons.
Exact per-lecture URLs are verified at promotion.

A short, durable list. Both are free.

  • David Silver, UCL RL course, Lecture 2: Markov Decision Processes (within the course above). The direct continuation: formalizing the loop this lesson sketched into a Markov Decision Process, the setup for the rest of the track. This is Track 17 lesson 2.
  • Richard Sutton and Andrew Barto, “Reinforcement Learning: An Introduction” (2nd edition, available freely at the authors’ page). The standard textbook the whole field references. Chapter 1 covers the same ground as this lesson at book length, with the multi-armed bandit fully worked in Chapter 2.

Where this leads inside this track and beyond.

  • Markov Decision Processes. The next lesson. It turns the loop here into a formal object (states, actions, transitions, rewards, discount), which the rest of the track relies on.
  • Value functions and the Bellman equations. Lesson 3. The mathematical heart of how RL reasons about long-run reward.
  • RLHF and DPO (AI Foundations, Track 5). A separate, more applied track. T5’s rlhf-and-dpo lesson covers the alignment side of using RL on large language models; this track teaches the RL mechanics that RLHF assumes, and lesson 10 closes the loop with an explicit bridge back to T5.